Item - 6- 



Internet Radio 



Page 1 of 8 



Home 

3. Internet Radio 



Broadcast radio hasn't changed much over the last few decades— it's always been a one-way medium. The 
program manager (and occasionally the DJ) determines the type of programming, including which songs are 
played and how often they are played. For the listener, it's a "take it or leave it" proposition. If you like what a 
station is playing, you listen to it; if you don't like it, then you either have to suffer through it or turn the dial until 
you find something better. 

Because traditional radio is not an interactive medium, there is little feedback from listeners. In theory, listeners 
ultimately determine what music a station plays and how often. But listener feedback is indirect and slow through 
the existing rating systems. And the ratings systems are driven more by business considerations than by the 
preferences of individual listeners. 

A common complaint about broadcast radio is that stations do little to help listeners identify songs. (How often 
have you heard a song on the radio and wanted to know the name of the song or artist, but the DJ never 
announced either one?) Other complaints include excessive amounts of commercials and a limited number of 
local stations to choose from. 

Internet radio eliminates many of the shortcomings of broadcast radio because it's delivered through the 
Web— an inherently interactive medium. Internet radio offers dozens of stations per site and allows 
interactive feedback so each listener can directly influence programming. 

Internet radio also gives you access to a far wider variety of stations and programming than traditional 
broadcast radio. Radio sites on the Web can have dozens of stations featuring uninterrupted music, comedy, 
sports and talk shows, news, special events and many other types of programming. 

Internet radio isn't limited by geography like broadcast radio is. In fact, Internet radio is often used to extend 
the reach of regular broadcast stations. If you're traveling out of the broadcast area of your favorite home 
station, you may still be able to listen to it if they also transmit their programming over the Internet. 

Another advantage of Internet radio over broadcast radio is its availability in buildings where radio reception 
is poor or regular radio isn't an option. As long as you have an Internet connection, you can tune m and listen 
anytime. 

Most Internet stations display the name of the song and the artist the entire time the song is playing, and many 
stations can also display album graphics, credits, and lyrics, along with links to the artist's Web site. If you 
hear a song you like, many stations provide a link so you can purchase the song or album on the spot. 

Some sites, like Imagine Radio, even allow you to set up a personal radio station, which you customize by 
selecting the artists and the types of music you want to hear. Once your radio station is set up, you can tune in 
and listen to music customized to your tastes. You can also make your station available to other listeners. 

Major players like America Online and Rolling Stone Magazine are getting involved in Internet radio. AOL's 
Spinner.com Web site offers over 100 stations and a selection of more than 150,000 songs. Rolling Stone 
Radio features stations that play music selected by rock stars, such as David Bowie, and other celebrities. 
Many of these larger sites also offer music charts, industry news and other types of music-related content. 

Many broadcast radio stations now have Web sites, and a few are beginning to offer their regular 
programming via the Internet. Sites such as Broadcast.com act as aggregators (collectors and distributors) ot 
streaming media programming and Web content for both traditional radio stations and Internet-only radio 

stations. 

Internet radio sites can generate advertising revenue with both announcement-type ads and banner ads. In this 
respect, the economic model of Internet radio is similar to broadcast radio. But Internet radio sites can expand 
on this model to earn commissions on products sold through their sites. Some sites even offer premium 
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subscription services, similar to cable and satellite TV. 

Most of the larger Internet radio sites and services feature banner ads, but at least most of them avoid 
promotional announcements that interrupt the music. Some services, like vTuner, offer commercial-free 
listening if you purchase their "plus" software. Only a handful of the larger sites, such as Green Witch, 
remain commercial-free. 

Due to the requirement for an Internet connection, Internet radio isn't yet as portable as broadcast radio. But 
within the next few years, hand-held PCs will be able to double as portable radios. (Hand-held PCs already 
offer wireless Internet access, and several models, such as the Cassiopeia El 00, include sound capability and 
software to play MP3s.) 



Sound Quality 

The main factor limiting Internet radio is bandwidth. Dual channel ISDN (128 kbps) is the minimum needed 
for high-quality stereo music, but the majority of users have much slower connections. Voice quality is 
usually fine at slower connection speeds, but music quality is barely acceptable with connections slower than 
56 kbps. 

Other fast Internet connections such as DSL, cable modems, and satellite links provide enough bandwidth for 
CD-quality audio. However, even with unlimited bandwidth, network congestion can cause problems during 
peak usage periods. These problems will eventually be solved, but it could be years before the majority of 
Internet users have access to fast connections. 

Another problem, even bigger than the connection speed of individual users, is that most streaming audio (and 
video) on the Internet is transmitted in a unicast mode— which is extremely inefficient. With unicast, each 
listener (or viewer) receives a separate stream. A station that has 500 users connected will send 500 copies ot 
the same stream. 

Even if all users had fast Internet connections, the Internet currently could handle only a few million 
simultaneous listeners with unicast transmissions. There is nowhere near enough server capacity and 
bandwidth to support tens of millions of listeners or viewers like network radio and network television can. 

Eventually, the Internet will become multicast enabled, and a single stream will be able to be shared by 
multiple users. Only then will Internet radio be able to compete on the scale of traditional broadcast media. 
By the time that point is reached, the traditional radio and television networks will have had a chance to 
transition much of their programming to Internet. 



Digital Radio 

Digital audio technologies like MP3 are a key part of Internet radio because they help squeeze more sound 
through slower Internet connections. But digital technology can also be used in other forms of radio as well to 
improve the sound quality and transmit related information along with the music. Broadcast networks already 
use MP2 (similar to MP3) to transmit audio signals to their affiliate stations. 

Eventually, all forms of radio will go digital. Cable and satellite TV systems already offer multiple music 
channels and have the capability to display text and video with music. Some systems, such as DMX, already 
offer digital transmission over cable and may eventually offer true interactivity. 

Traditional broadcast radio stations can use digital transmission to offer improved sound quality and display 
song titles and artist names along with the audio. Portable digital radios featuring a small display to show text 
and graphics will become commonplace within the next few years, and portable satellite and cellular radio 
services that allow a station to "follow" you as you drive across the continent will also become available. 

Internet radio is available now, and, thanks to the high degree of interactivity provided by the Web, it opens 
the door to a whole new world that broadcast, cable and satellite radio can't. Eventually cable and satellite 
services will offer interactivity or even merge with Web TV and Internet radio. Until these media converge, 
consumers will be faced with a bewildering array of delivery mechanisms for audio and video content. 
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Listening to Internet Radio 

Figure 4 - Media Overload 



INTERACTIVE 

The Internet, 
Other Networks 




PHYSICAL MEDIA 

CD, DVD, Tapes 

Hard Disks, 
Flash Memory 



To listen to internet radio, you neecfsoivware tnai can piay streaming audio. Streaming audio is a subset of 
streaming media (audio, video and text, etc.) and comes in several formats, so you may need to install more 
than one program. 

Many players such as the RealPlayer G2 and the Windows Media Player, support multiple formats, including 
streaming MP3. The fact that there are multiple formats and players for streaming audio can be confusing. 
Fortunately, most sites include links for you to download any software required to listen to them. 

At the very least, you should install the latest versions of the RealPlayer, Windows Media Player and at least 
one of the full-featured MP3 players, such as Sonique or Winamp. These programs will allow you to listen to 
the majority of Internet radio sites. 

Most players can be downloaded for free, although there are a few, such as the Plus version of the RealPlayer, 
that you must purchase. Fortunately, the free version of the RealPlayer offers everything most users need. The 
Windows Media Player is included with Windows. Updates can be obtained for free directly from Microsoft. 
Sonique and Winamp are both freeware. 

Many of the larger radio sites, such as Spinner.com and Rolling Stone Radio, require you to install their own 
"tuners." Both of these sites use RealAudio, and you'll need to install the RealPlayer along with thetr own 
software. Many of these larger sites also require you to register before you can download their software, and 
some require you to login each time you listen. 

vTuner 

vTuner Main Screen 
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vTunei 




provides an easy way to find and listen 



to thousands of stations (radio, television, Webcam, and others) from all over the world. The free version of 
vTuner categorizes stations by type and geographic location and provides browsing and searching capabilities. 
vTuner Plus ($29.99) adds station scanning, playback scheduling, station ratings based on quality and reliability, 
and replaceable "skins." The Plus version also lets you avoid listening to advertisements. 
Table 2 lists some of the more popular streaming media player software that can be used to listen to Internet 
radio. 



Table 1 - Streaming Media Player* 



Player 

QuickTime 


Streaming Formats 

QuickTime. MP3 and others 


Web Site 


RealPlayer 

Rolling Stone Radio Player 


RealAudio, MP3 and others 
RealAudio 


w»iw.raradfo.com 




RealAudio 




vTuner 


RealAudio 


www.yfunBr.com 


Winamp 


MP3 and others 


mmj/ioaraBJsm 


Winplay 

Windows Media Player 


Encrypted MP3 
WMA, MP3 and others 





Popular Radio Web Sites 

A sampling of popular Internet radio sites follows. Many more sites are available but aren't covered here due to 
space limitations. (For more listings, see the Internet Radio section of Appendix A, Interesting Web Sites.) 

Green Witch 

Green Witch ( www.ffreenwitch.com) is an easy-to-use site that offers a wide range of commercial-free music on 
multiple channels, including genres such as alternative rock, blues, classical, hip-hop and more. 

Green Witch helps link independent artists to fans. Artists can have their songs added to Green Witch's streams 
with features to help them sell music, including an artist information page, a "Buy" button that links to retail 
fulfillment and a "Download" button for on-demand retail. 

Green Witch also provides links to dozens of independent Icecast stations with offerings that range from various 
genres of music to comedy and talk shows with offerings such as Rush Limbaugh and animal noises. (See if you 
can tell the difference between the latter two.) 

The channels offered by Green Witch use streaming MP3. To listen to a channel, click on the speaker icon to the 
left of the channel name. 




playing, click on the Pause button above the song title. You ca 
back to the previous song because of current webcasting laws. 
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To create your own customized radio station, you first select a name, a scene, and the music genres (Blues 
Jazz E etc.) to include. Then you browse a list of artists and rank them depending on how frequently you 
want their songs to be played. To listen to your station, select the play my station button from the main page. 

You can rate any song by clicking on the Edit button while the songis playing This influences how ofter i the 
song plays onyour station in the future. Ratings go from 0 to 5 andlR. If youTiate the song and never want to 
hear fi pfayed on your station, select 0. If you Tove the song and want to hear it frequently, select 5. To have the 
Imagine Radio DJ decide how often the song is played, select IR. 

You can make your custom station available to others, and you can ^ t f-n™l ^ 

can also listen to radio stations that other listeners have put together, (with names like Fartsmtter ana Kave/Lone; 
grouped in such themes as Carnival, French Quarter and Woodstock. 

To listen to Imagine Radio, you need either the RealPlayer G2 or the Windows Media Player, 
an array of music, comedy and celebrity interviews. 

RadioMoi was the first webcaster to sign an agreement with the RIAA and to be licensed l»"DJg 
Millennium Copyright Act. This license allows RadioMoi to stream copyrighted found recordings retires 
Km to make royalty payments. RadioMoi also provides links to artist and record label Web sites and lets 
listeners purchase albums on the spot. 

Radio Moi uses an encrypted form of MP3. To listen to audio, you must use Winplay (the free RadioMoi player) 
ff you have other MP3 players, such as Winamp, installed you may need to change the app icatton associated with 
the M3U file type. Otherwise, your MP3 player may attempt to 

because of the encryption. (See Chapter 9, Organizing and Playing Music, for more information on file type 
associations.) 

Rolling Stone Radio 

Rolling Stone Radio (w ww rsradio.com) is Rolling Stone Magazine's Internet radio site It offers a diverse 

which is the David Bowie Radio Network, offenng the rock star s 

favorite music. 

Rolling Stone Radio's channels are located on the left side of the player. Only some < >f shojv ■ ^Tanno"^ 
to use The up and down arrow keys on the player to scroll through them. Once you fi^a <dwnnd y°» J a nt to 
listen to, click the Play button. After several seconds, the song will begm to play, and the title and artist name will 
be displayed. Click on the artist name to see more information about that artist. 

You can rate any song while it's playing by clicking on one of the checkboxes . labeled J^W^ftj^ 
you can click on a link that will take you to Amazon.com, where you can purchase the album. You can also cue* 
on a button to submit a song request to the Rolling Stone Radio DJ. 

To play Rolling Stone Radio, you need the RealPlayer G2 and the Rolling Stone Radio tuner, both of which can 
be downloaded at the rsradio.com site for free. 

Rolling Stone Radio Tuner 
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Spinner 

Spinner ( www spinner com) is owned by America Online and offers access to more than 150,000 songs across 
100+ music channels, grouped by genre, with programmable presets. You can rate any song that's played, access 
artist information, and, if you want, purchase the CD. Currently, Spinner doesn't allow you to set up your own 
station. 

To listen to Spinner on a Windows system, you need to install their stand-alone player. If V™'™ ™ ' » M" ° r 
Unix system, Spinner offers a player that runs in conjunction with your Web browser. You 11 also need to install 
the RealPlayer . 

Streaming MP3 

Streaming MP3 has rapidly become the choice for amateur webcasters worldwide. Now, anyone > with a PC and an 
Internet connection can inexpensively stream music to listeners throughout the world, using SHOUTcast or 
Icecast streaming MP3 software. 

Spinner Plus Tuner 





1 ■„ 










\ 


i 










A 









SHOU i cast 

Nullsoft's SHOUTcast (www.shautcast.com) provides users with a simple way to ™ * Jj™ 

all over the world. SHOUTcast servers can submit their descnpnon and status back to the main SHOUTcast 
server directory, which allows listeners to locate SHOUTcast servers without knowing their IP addresses. 

tZTL^morg) is an open source streaming MP3 server, similar * SHOUTcast. It is available = for free 
including the source code. Because it is open source, useful modifications and additions by users are incorporated 
back into the main code for the benefit of all users. 

1 P $ L. M M«) helps you find SHOUTcast servers and listen to webcaster^om all^ avei ^ ^rld. 
MP3Spy lists the available SHOUTcast servers and identifies them by music genre and type of programming. 
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When you choose a server, MP3Spy connects you to the server's audio stream and the Web page of the station. 
When you connect to a SHOUTcast server, you can also chat with the DJ or other users who are listening to the 
same music. You'll need an MP3 player, such as Winamp or Sonique, to use MP3Spy. 

Webcasting Licensing 

Internet radio stations can give listeners a high degree of control over the music they hear, but the music 
industry seems to fear that giving listeners too much control will reduce music sales. Their reasoning seems to 
be that if people could choose to listen to any song at any time, there would be little incentive for anyone to 
actually purchase music. 

While the recording industry was slow to recognize the potential of downloadable music, it was quicker to 
recognize the potential (and threat) of Internet radio and lobbied to have laws enacted to protect its interests. 
The Digital Millennium Copyright Act (sponsored by the recording industry) addresses the issue of 
webcasting by providing statutory (mandated by law) licenses for webcasters who meet certain conditions 
(See Chapter 5, Digital Music and Copyright Law, for the requirements for statutory webcasting licenses.) 

Some Internet radio sites exist that webcast music illegally, but many webcasters want to be "legal" and are 
obtaining or have obtained the licensing required. Amateur webcasters are popularizing streaming audio, jus 
like grass roots support and the Internet popularized MP3. But the recording industry is bent on ensuring that 
proper royalties are paid whenever copyrighted music is played and that music streamed over the Internet 
doesn't cut in to music sales. 

In addition to licensing fees, webcasters are subject to several significant restrictions. For example, while 
Internet radio listeners can select the songs they want to hear, it is illegal for webcasters to allow them to 
select a particular song to play instantly, unless the song has been specifically authorized for interactive 
distribution. Even though listeners can create personalized stations, the site's DJ must rotate the playhsts and 
determine when each song is played. 

Webcasters are concerned that these types of restrictions will inhibit their ability to play the music that listeners 
want to hear, and make it financially unfeasible to operate a radio site. Internet radio is evolving rapidly and more 
legislation may be required as it matures. Eventually, more standards and laws will be established and Interne 
radio will become a major component of our media, just like broadcast radio and television. Until then, ,t s still a 
bit like the Wild Wild West. 
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A Content-Aware Sound Browser 
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ABSTRACT: The SoundFisher™ browser is a cross-platform, single- or multi-user sound-effects database 
application. It incorporates an audio analysis engine that permits retrieval of sounds based on their acoustical 
similarity, as well as on traditional keywords and file attributes. Databases can be constructed from sounds on a 
local filesystem and/or the World Wide Web. 



1. Introduction 

When studios amass collections of sound 
effects and other sound files totaling hundreds of 
gigabytes, intuitive means of organizing and 
retrieving sounds become imperative. Traditional 
approaches have required lime-consuming manual 
classification and organization. Describing the 
sound with keywords, while important, is not 
enough. What is needed is a system that can 
automatically compare, classify, and retrieve 
sounds. A number of researchers have investigated 
the problem of how to automatically classify and 
retrieve non-speech audio. (See references.) 

This paper describes an audio engine and an 
end-user application, both already implemented, 
that permit retrieval of sounds based not only on 
traditional methods (keywords, soundfile header 
information, creation date, etc.), but also on the 
sounds' content — i.e, acoustic attributes. The end- 
user application is a multiplatform, multiuser 
sound-effects browser, called SoundFisher. (This 
paper does not mention the many features of the 
engine that are unused by SoundFisher. See Wold 
et al. 1999 for more information.) 



2. Audio feature analysis and comparison 

This section summarizes our technique for 
analyzing audio signals in a way that facilitates 
audio classification and search. For each frame of 
audio data (25 ms long, with a hop size of 10 ms) 
we measure a number of acoustic features of each 
sound. The analysis produces, over the course of 
the entire sound, a time series where each element 
is a vector of floating-point numbers representing 
the instantaneous values of the features. This sort 
of analysis works best when the sound is 
homogeneous in character, e.g., a door slam or 
rain. When analyzing a longer heterogeneous 
recording, e.g., a news broadcast, one can 
automatically segment the recording and compute a 
feature vector for each segment. 



2. /. Frame-Level Features 

The following features are currently extracted 
from each frame: loudness, pitch, brightness, 
bandwidth, and mel-filtered cepstral coefficients 
(MFCCs). The first three features were discussed in 
Keislar et al. (1995). Bandwidth is computed as the 
magnitude- weighted average of the differences 
between the spectral components and the centroid. 
A vector of MFCCs is computed by applying a 
mel-spaced set of triangular filters to the STFT and 
following this with the discrete cosine transform. 

Since the dynamic behavior of a sound is 
important, the low-level analyzer also computes the 
instantaneous derivative (time differences) for all 
the aforementioned features. 

2. 2 Higher-Level Features 

From the time series of frame values, we 
extract higher-level information. We compute the 
mean and standard deviation of the frame-level time 
series for each parameter, including the parameter 
derivatives. When computing the mean and standard 
deviation, the frame-level features are weighted by 
the instantaneous loudness so that the perceptually 
important sections of the sound are emphasized. 

The user can present the system with a single 
sound for comparison, or with examples of a class 
of sound. In the latter case, we can infer something 
from the variability of the parameters across the 
different recordings. For example, there may be 
several samples of oboe tones, each at a different 
pitch. If one of these is presented as an example of 
an oboe sound to the system, the system has no a 
priori way of determining that it is the timbre of 
the sound that determines the class, rather than the 
particular pitch of this sample. However, if all the 
samples are presented to the system, the variability 
of the pitch can be noted across the samples and 
then used to weight the different parameters in 
comparison. This information can then be used 
when comparing new sounds to this class. This 
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variability is stored in the standard deviation 
portion of the class's feature vector. 

For this single-Gaussian statistical model, the 
distance measure used is essentially the Euclidean 
distance between the two sounds' feature vectors, 
with each dimension scaled by its standard 
deviation. The user is given the ability to apply 
additional weights to the different features, which is 
useful when certain features are known to be more 
(or less) pertinent to the task at hand. 



3. The SoundFisher Sound-Effects 
Browser 

This section presents the SoundFisher audio 
browser, which runs on the Macintosh, Windows 
(95, 98, and NT), and UNIX (Solaris, SGI) 
platforms. Written in Java, the SoundFisher GUI 
communicates with an included audio-analysis and 
database engine written in C. 

Figure 1 shows the GUI for the application 
after a search has been performed. The row of 
buttons across the top provides functions such as 
"forward" and "back," allowing navigation to 
previously displayed query results. Below the 
buttons is the query area, and the bottom portion of 
the window displays the query results. 

A database is built up by adding URLs to it 
(either local files or Web addresses.) Directories can 
be added recursively; in a single step one can add all 
the sound files on a given disk, for example. The 
supported audio file formats include WAV, AffF. 
AU, and Sound Designer II. When sounds are added, 
the engine analyzes the audio in the file or URL 
and stores the resulting feature vector in the 
database. Long sound files can be automatically 
segmented. In addition, "thumbnails" of sounds can 
optionally be generated. A thumbnail is a low- 
resolution, optionally truncated version of the 
source sound file. Thumbnails are useful for 
auditioning search results when the original sounds 
are offline. 

The data record for each sound includes not 
only the acoustic feature vector but also soundfile 
information (sample rate, format, number of 
channels, duration, etc.), date, and textual keyword 
and comment fields. The text fields can be applied 
recursively when adding a directory. In addition, the 
user can define and add new text fields. Text fields 
can be edited at any time. 

Users can create hierarchical categories of 
sounds. The default categories mimic the filesystem 
organization of the added directories, but the 
category names and hierarchy can be easily edited 
using a familiar graphical paradigm (e.g.,Windows 



Explorer or the Macintosh Finder's List view). 
These categories are arbitrarily defined by the user. 
The user can also create "classes": sets of sounds 
whose acoustical feature vectors are close to sounds 
that the user has provided as a training set. 

Below the top row of buttons is the query area, 
which is reminiscent of the Find File utilities on 
Mac and Windows. Multiple criteria can be 
combined with a Boolean AND operation. A query 
is formed using a combination of constraints on the 
various fields in the database schema as well as 
"query by example" (comparison to a selected 
sound). For example. Figure I illustrates a query 
based on similarity to a selected sound (the noise of 
a crowd), in combination with a constraint based on 
a data field (duration). As indicated, the search can 
operate over the entire database, or it can apply to 
the currently displayed or currendy selected records. 

The bottom portion of the window displays the 
current records (often, the result of a query). Sounds 
can be auditioned by double-clicking, and multiple 
selections are possible. These results can be viewed 
in one of three ways: as a list (Figure 1), 
hierarchically by category, or as a 2-D plot (Figure 
2). In the 2-D plot, the axes can be various acoustic 
attributes or the begin and end times of the sounds. 
The begin and end dmes are useful for 
automatically segmented sound files; by choosing 
begin time for an axis, one can view the temporal 
trajectory of a particular acoustic feature. 

For more information, see the Muscle Fish 
Web site, www.musclefish.com. 
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Abstract 

Despite vast research and development efforts in such diverse domains as digital signal processing, 
psychoacoustics, speech recognition, computer music, and multimedia databases, there is a paucity 
of literature that addresses the issue of automatic classification of sounds (whether real-world 
sounds or synthetic sound effects). In this paper we suggest that much of the necessary 
knowledge exists today but needs to be redirected or refocused to yield the solutions to these 
specific problems. After surveying some related research, this paper presents our work-in- 
progress, developing an analysis engine and a client application that would help automate the 
process of sound classification, retrieval, and selection. 



Anyone who has ever heard a sound and tried to describe it knows of the difficulties. Words are woefully 
inadequate to convey the essence of a sound. It has been said, for example, that writing about music is like 
dancing about architecture. This being the case, people often resort to describing sound by the use of simile ("it 
sounds like a herd of elephants") or through the use of onomatopoeia (a film producer working at the side of a 
sound designer might ask for "a good thwack in that face-hit."). At other times people will describe a sound by 
referring to some emotion that it evokes ("it has a mournful sound"). A musician might describe a sound using 
musical terms like "crescendo from mezzo-piano to fortissimo." Similarly, an acoustician might refer to a sound's 
acoustical attributes ("it has an exponential decay and the energy is concentrated in the upper partials"). And of 
course, sometimes descriptive adjectives are evocative ("a shirnmery sound"). When words fail to describe a 
sound, the sound might best be described by comparing it to another ("this sound is very similar to that sound," or 
"find me another sound that sounds like this one"). 

But who really cares about finding a sound, whether by describing it verbally or by comparing it to another 
sound? Ask the film sound-effects designer who is struggling to finish off that large-restaurant ambient sound for 
the current scene while the producer runs out of patience. Or ask those computer animators that wish they had a 
system that would automatically find and mix some appropriate sounds to accompany their latest short film. And 
ask the technology-sawy sales representative who has finally been convinced that multimedia can breathe Life into 
the presentation (due tomorrow) and who is now trying to find a sound with punch to conclude the slide show. 

Such users are greatly hampered by existing sound-effects databases (often referred to as "librarians"). These 
databases typically permit the user to associate a limited number of textual keywords and/or descriptions with each 
sound. A sound can often be placed into a category, but generally cannot appear in more than one category. This 
is a severe limitation, since the classification of any sound will often change depending upon the way that sound 
functions within a specific category (for example, within a given document or composition). 

While words can sometimes serve as sufficient keys for the retrieval of data, there are many cases where one 
would like to query the source more directly. Content-addressable databases or content-based retrieval are the 
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terms usually used to describe this kind of information storage and retrieval system. Such systems permit 
searches for features, keys, or triggers extracted directly from the data (as opposed to being one step removed 
from the data, as in a keyword-only description of that data). While this capability has ceased to be a problem for 
text-only databases, the race for a solution to this problem in the area of multimedia document management is just 
starting. As usual, sound has taken a back seat to image (both moving and still), but it will not be ignored for 
long. As more and more sounds become available to creators of multimedia works, there is an increasing need to 
quickly and efficiently locate a particular sound or a set of similar sounds. Giving authors of multimedia 
documents a choice of hundreds or thousands of sounds without the tools to find them is a little like casting 
someone adrift in a rowboat without any oars. 

In this paper, we will introduce some novel approaches to analyzing, cataloguing, and retrieving sounds from an 
audio database. Although the approaches we will present are somewhat new, the techniques of time-varying 
signal analysis and processing on which our approach is based have been in existence (or evolving) for years. 
One of the main goals of this paper is to suggest ways in which sounds may be retrieved from repositories by 
using any one or a combination of objective (acoustic) metrics, by specifying subjective perceptual features, or 
even by selecting or entering a reference sound and asking the database to retrieve all sounds that are similar (or 
dissimilar) to it. The four main sections of this paper review the related literature, explain the signal-analysis 
techniques that we propose for a sound database, present the database schema, and "walk through" the functions 
of our suggested database browser. 



2. Previous Research 
Sound Taxonomies 

During the last four decades, numerous attempts have been made to develop taxonomies of sounds, including 
musical and environmental sounds [Schaeffer 1966, Schafer 1980, Tenney 1961, Vertegaal and Bonis 1994]. 
This interest of musicians and psychologists has shed some light on the analysis and classification of sound, be it 
by objective or subjective measures. 

Timbre Analysis 

Sounds are traditionally described by their pitch, loudness, duration, and umbre. The first three of these 
perceptual attributes are well-understood and fairly easily measured. Timbre, on the other hand, is an ill-defined 
attribute that encompasses all the distinctive qualities of a sound other than its pitch, loudness, and duration. The 
effort to discover the components of timbre underlies much of the previous psychoacoustic research that is 
relevant to content-based audio retrieval [Helmholtz 1885, Risset and Mathews 1969, Plomp 1976, Grey 1977, 
Gordon and Grey 1978, Wessel 1979]. 

Salient components of timbre include the amplitude envelope, harmonicity, and spectral envelope. The attack 
portions of a tone are often essential for identifying the timbre. Timbres with similar spectral energy distributions 
(as measured by the centroid of the spectrum) tend to be judged as perceptually similar. However, research has 
shown that the time-varying spectrum of a single musical instrument tone cannot generally be treated as a 
"fingerprint" identifying the instrument, because there is too much variation across the instrument's range of 
pitches, and across its range of dynamic levels. 

Source Separation 

Simultaneous sound sources present a huge obstacle for any sound-analysis environment. Approaches to 
separating simultaneous sounds typically involve either Gestalt psychology [McAdams 1984, Bregman 1993, 
McAdams 1993] or non-perceptual signal-processing techniques [Moorer 1975, Wang 1994]. For musical 
applications, automatically parsing a monophonic melody is feasible, but a completely general-purpose, 
polyphonic pitch-tracking algorithm might well be an intractable problem, recent efforts notwithstanding [Moorer 
1975, Chafe et al 1985, Depalle et al 1993]. 
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Sound Librarians and Editors 



As mentioned earlier, most existing sound databases and sound librarians suffer from a retrieval paradigm that is 
limited to keywords or mutually exclusive categories, with no possibility of content-based retrieval. Also, data 
entry in existing systems is intensive: there is no automatic analysis or classification of sounds. 

Various researchers have discussed or prototyped graphical sound editors capable of extracting musical structure 
from a sound [Buxton et al 198 1, Chafe et al 1982, Foster et al 1982], The goal was to allow queries such as 
"find the first occurrence of the note G-sharp." Unfortunately, most of this research came to a halt after the 
introduction of MIDI (Musical Instrument Digital Interface), which eliminates the need for this functionality in the 
:ase where the sound is generated by a MIDI instrument. 

rhe Intuitive Sound Editing Environment (ISEE) is a recent software package for controlling MIDI synthesizers 
.Vertegaal and Bonis 1994]. Its user interface is based on the notion of timbre space [Wessel 1979] — that is, 
continuous control of a small number of orthogonal, device-independent timbral parameters. The parameters are 
overtones" (haimonicity), "brightness" (spectral energy distribution), "articulation" (control of spectral transients 
uid persistent noise), and "envelope" (the speed of the amplitude envelope). These parameters could also be 
ipplied to the analysis of digital audio, an approach that would bear some similarity to the analysis techniques 
presented in this paper. 



3. An Analysis Engine for Content-Based Retrieval of Audio 

In this section of the paper, we present a general paradigm and specific techniques for analyzing audio signals in a 
way that facilitates content-based retrieval. (Section 5 of the paper describes a client application with a graphical 
user interface for retrieving the audio data.) 

3.1 Introduction To The Analysis Technique 

By content-based retrieval of audio, one can mean a variety of things. At the simplest level of implementation — 
but the least simple level of usage— one could retrieve a sound by specifying the exact numbers in an excerpt of 
the sound's sampled data. At the next higher level of abstraction, the retrieval would match any sound containing 
i he given excerpt, regardless of the data's sample rate, quantization, compression, etc. At the next level, the query 
might involve frequency-domain information or other acoustic attributes that can be direcdy measured. Finally— 
at the most difficult level of implementation but potentially the most user-friendly level— the query could include 
perceptual (subjective) properties of the sound. 

It is this final level — perceptual properties — with which we are most concerned. (The implementation of the first 
two levels is conceptually straightforward and need not be discussed here.) Some of the aural (perceptual) 
properties of a sound, such as pitch, loudness, and brightness, correspond closely to measurable attributes of the 
audio signal, making it logical to provide fields for these properties in the audio database record. However, other 
aural properties (for instance, "scratchiness") are more indirectly related to easily measured acoustical attributes of 
the sound. Some of these properties may even have different meanings for different users. (The phenomenon of 
synaesthesia is an extreme case of subjectivity: a user might call certain sounds "blue" and others "red."). To 
support subjective properties, the database record format should be user-extensible. 

To be able to use different perceptual criteria to retrieve a sound, we first measure a variety of acoustical attributes 
of each sound. This set of N attributes is represented as an N- vector. In text databases, the resolution of queries 
typically requires matching and comparing strings. In an audio database, we would like to match and compare the 
sort of aural properties described above (such as "scratchiness"). For example, we would like to ask for all the 
sounds similar to a given sound or that have more or less of a given property. To guarantee that this is possible, 
the space of N- vectors should satisfy the following constraints for each aural property to be used m retrieval: 
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1 Sounds which differ in the aural property should map to different regions of the A/-space. If this were nol 
satisfied, the database could not distinguish between sounds with different values for this property. 

2 If the user ranks a set of sounds in increasing amounts of the given aural property, these sounds should 
map approximately to a smooth path in some M-dimensional projection of the JV-space where M^V. Since we use 
a linear model, we have the additional restriction that the smooth path should be approximately linear. Note: In 
the case where the aural property is binary— that is, where a sound either has the property or it doesn't— the linear 
constraint is not so important. 

Since we cannot know the complete list of aural properties that users may wish to specify, it is impossible to 
guarantee that our choice of acoustical attributes will meet these constraints. However, we can make sure that we 
can meet these constraints for many useful aural properties. 



3.2 Acoustical Attributes 

The following aspects of sound are analyzed: 

• Pitch Pitch is estimated by taking a series of short-time Fourier spectra. For each of these frames, the 
frequencies and amplitudes of the peaks are measured and an approximate weighted greatest common divisor 
algorithm is used to calculate the pitch (expressed as log frequency). The pitch algorithm also returns a pitch 
confidence value which can be used as a measure of "how pitched" the sound is. 

• Harmonicity. This parameter distinguishes between harmonic spectra (e.g., vowels and most musical 
sounds), inharmonic spectra (e.g., metallic sounds), and noise (spectra that vary randomly in frequency and 
time). 

• Loudness. Loudness is approximated by the signal's RMS level in decibels, which is calculated by taking a 
series of windowed frames of the sound and computing the square root of the sum of the squares of the 
windowed sample values. (This method does not account for the frequency response of the human ear; if desired, 
the necessary equalization can be added by applying the Fletcher-Munson equal-loudness contours.) 

• Brightness. Brightness is computed as the centroid of the short-time Fourier spectra. 

• Formants. Formants are computed by smoothing the short-time Fourier spectra and looking for broad 
peaks (We use the word "formants" loosely to refer to broad spectral peaks, which may or may not be constant as 
the fundamental frequency varies.) The formants are parameterized by logarithmic values of frequency, 
magnitude, and width. 

All the above aspects of sound can vary over time. The trajectory in time is computed during analysis but not 
stored as such in the database. However, for each of these trajectories, several parameters are computed and 
stored, including: 

• Average. 

• Maximum and minimum. 

• Variance. 

• Autocorrelation. This is a measure of the smoothness of the trajectory. This can distinguish between a pitch 
glissando and a wildly varying pitch (for example), which a simple variance measure cannot. 

• Parameters relating to the shape of the smoothed trajectory: critical points, number of inflections, attack and 
decay time (of loudness trajectory). 
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In addition, the duration of the sound is stored. The ^/-vector of measured attributes thus consists of duration plus 
the parameters just mentioned (average, maximum, minimum, autocorrelation, shape parameters) for each of the 
aspects of sound given above (pitch, harmonicity, loudness, brightness, and formants). If some of these 
parameters are found to be less useful for certain of the aspects of sound, those attributes can be omitted for 
efficiency. 



3.3 Training The System and Retrieving Data 
Continuous Properties 

As we mentioned above, some aural properties will directly relate to the measured attributes above. However, we 
" need a method to teach the system about new aural properties, especially those that are subjective and can vary 
between different users. 

To train the system, the user picks a set of sounds which show varying amounts of the property in question. The 
user sorts the sounds according to their perceived ranking, assigns an approximate numerical value of the property 
for each sound, and submits them to the system. When training by example, the more examples the user has, the 
better the system's understanding will be. It would also be best if the user submitted examples which covered a 
wide range of values for the property. 

For each sound s[jj, j=0 to M-l, the system computes the W-vector a, if it is not already computed. (M is the 
number of sounds to be analyzed, and N the number of acoustical parameters.) To find the relationship between 
the aural property p[j] of each sound and the measured attributes we use a standard linear regression model 
with parameter vector b. That is: v 



P(j] = b T a[j] + e[j] 

where e is the error in the model. (The superscript T indicates a transposed matrix.) 

Given p[jj and a(j], the b parameters can be estimated using least squares, which is the unbiased, nunimum 
variance estimate for b. Note that the elements of b can be reported to the user to indicate which elements of a are 
most significant to the aural property under consideration. 

The algorithm computes the variance of e, which can be examined to give the user an indication of how well the 
model fit the data. At some threshold, the system could indicate that the training failed — that is, the user needs to 
supply more training sounds, or the currently measured attributes do not meet the constraints (listed in Section 
3.1, above) for the specified aural property. In the latter case, the property might be an uncomputable property 
(e.g., "how much I like the sound"), or there might be further measurements which could be added to the 
repertoire to increase the usefulness of the database. 

Once the mapping between measured attributes and the aural property is understood by the system, the mapping is 
applied to each sound in the system (or each of the sounds currendy of interest to the user). The name and value 
of the property is included in the database record for that sound. Once this is accomplished, it is straightforward 
to select sounds from the database with queries relating to the property. For example: 




Query by value, Retrieve all the sounds with values of property pg greater than 0.9 and property p ; less than 



• Query by example: Retrieve all the sounds similar to this sound with respect to property pq . "Similar to" 
means "within some delta of." Retrieve all the sounds with less p j than this sound. 
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• Organization/Browsing: Sort the current sounds by property pO. Group the current sounds by properties pq 
andpy. 

Binary or Discrete Properties 

In these cases, the property is used as a classifier. In the binary case, a sound either has this property or does not, 
and in the more general discrete case, the sound falls into one of several categories. To train the system, the user 
selects examples of sounds which have the property or do not (in the binary case) or which illustrate the different 
categories (in the general discrete case). 

For each sound, the a vectors are computed if they have not already been computed. The mean vector ji and the 
covariance matrix R for the a vectors in each category are then calculated. The mean and covariance are given by: 

m = (l/M)Za[j] 
R = (l/M)Z(a[j]-li)(a[j]-V) T 
where Af is the number of sounds in the summation. 

When a new sound needs to be classified, a likelihood value is calculated using the multivariate normal 
distribution: 

exp((-l/2)(a-^)TR-l(a-n)) 

To speed up the process, a simpler likelihood value could be used— namely, just the argument of the above 
exponential, with appropriate normalization. 

If the likelihood for the category is above a threshold, the sound is determined to be in that category. If there are 
several mutually exclusive categories, the sound is placed in the category with the highest likelihood. 



3.4 Miscellaneous issues in sound analysis 
Segmentation 

The above discussion handles the case where each sound is a single gestalt. Longer recordings need to be 
segmented before using the retrieval features above. Segmentation is accomplished by applying the acoustic 
analyses above to the signal and looking for transitions (sudden changes in the measured attributes). The 
transitions define segments of the signal, which can then be treated like individual sounds. 

For example, if the user of the system trained it using applause sounds, a recording of a concert or other 
performance could then be automatically scanned for applause sounds to determine boundaries. Similarly, after 
training the system to recognize a certain speaker (primarily by the formant structure), a recording could be 
segmented and scanned for all the sections where that speaker was talking. 

Sonification 

Non-audio data can be converted into audio data using a mapping between arbitrary data parameters and audio 
parameters. This is known as sonification (the sound analogue of visualization). Once data is put into a 
meaningful audio form, it can be treated exactly like audio data, and one can make use of the audio retrieval _ 
methods above. For example, a scientist or a medical technologist might find it easier to identify the "sound" of a 
certain sonified function than to describe its numerical or visual characteristics. 
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Neural Nets 



Neural nets are another possible way to find the mapping between between acoustical attributes a and perceptual 
properties p The advantage of neural nets is that they can discern non-linear mappings. The disadvantage is that 
it is difficult to see what is going on "inside" the net. Thus, one cannot estimate how good a model the system has 
discovered (how well it understands), and one cannot see which measured attributes correlate most strongly with 

User control of analysis 

In some cases, the user might know which measured attributes are relevant to the aural property in question. The 
analysis of an aural property can be made faster and more reliable in these cases by allowing the user to constrain 
the analysis to a subset of all the measured attributes. 

One can also imagine making the analysis engine extensible, by allowing a user to "plug in" new analysis 
algorithms. This feature would complicate the software design and implementation. 



4 The Database Schema 

This section describes a data record structure suitable for use with the analysis engine discussed above. 

Each entry in the database points to and describes a particular sound. Fields in eachentry include attributes such as 
the sounds name, the list of categories and subcategories that the sound is a member of the list of fe*ures ^as 
computed by system analysis roufines), a keyword list (user-defined), and links to related files and documents 
which exist outside the database proper. The database schema is summanzed in Figure 1. 

The "name" field contains a string that identifies the sound. If no name is supplied, the name of the sound file is 
used. 

"Categories" are arbitrary optional classifiers whose existence is unknown to the audio analysis engine_ They can 
be used o identify the project-that is, the group of users or the works (such as compositions or soundtracks) that 
make us of he ound.^r they can represent other user-specified classification ^^-^^^ of 
musical instruments. A sound may be a member of many (or no) categories. F ^^ a a ^ 
sound might be be contained in the categories "marketing," development woodwind^ and [jazz. Categories 
can recursively contain subcategories; a character such as 7" serves as a delimiter for subcategory names^ A 
raTego^gS^refer to a tefm of users within a department or to some subdivisior t of the >work ^ch as one 
scene in a film or one track of a soundtrack. (Recursive hierarchies could conceivably be useful for aural 
properties as well, but for simplicity we allow only categories to be specified hierarchically.) 

"Keywords" are optional user-supplied words, usually descriptive adjectives or nouns, that help tag a sound for 
queries. The only essential difference from categories is that keywords cannot be hierarchical. 

The "features" field of the database schema contains the results of the acoustical 

other words, this field stores the "measured attributes" vector. (As was discussed in Sec ion 3.2. J« ^neasured 

the' aura! properties the system is 'V^^SS^Si 
Later, we will describe how the feature list is used by the browser, and how users can define new aural 
properties. 

The aural Drooerties field just contains the names of the aural properties. Aural properties are fully specific ied by a 
which includes the name of the property, the name of the user who defined the 
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property (by training the system to understand it), the mathematical mapping of the measured attribute vector to the 
aural property, the date the property was last defined or modified, and links to the sound files (and/or the sound 
names) that were referenced in defining the property. 

In many applications, sound files can have heterogeneous data formats, such as different sample rates, bits per 
sample, or compression schemes. Most sound file formats include a header that fully specifies the data format. 
However, this information could also be stored in the data record. Although such information can be of interest to 
users, we have omitted it from the illustrations for simplicity. 

Also the database schema presented here doesn't distinguish between the data of different users. A full-blown 
multi-user system might have fields specifying the user under the aural properties, comments, categories and 
keywords fields This would allow two different users to define two distinct subjective aural properties that 
happened to have the same name (such as "buzzy"). Users could decide to hide their data from others or share it. 



name : 

defaults to sound file name 

dates: 

created: 

analyzed: 

accessed: 

features : 

analyzed by the audio analysis engine 
measured attributes: 

objective attributes of the sound 
aural properties: 

subjective perceptual features 

keywords: 

just text, user-supplied, non-hierarchical, optional 

categories: 

user-supplied, optional text, 

hierarchical OK (delimiter defaults to "/") 

comments: 

system-supplied text 
user-supplied text 
voice annotations 
graphic annotations 

history: 

sound file revision history 

links: 

sound file 
music score file 
synthesizer "patch" 
source data file for sonification 
Figure 1. Database schema 
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5 A Tour Of The Database Browser's Functionality 

In this section we present the functionality and user interface of an application that retrieves audio from a database 
using the analysis engine and database schema described above. This front-end database application lets the user 
browse through sounds or search for them using queries. In addition, it permits general maintenance of the 
database's sound entries: adding, deleting, analyzing, and describing sounds. It also lets the user define new 
"aural properties." The browser described here is merely one of a number of possible applications that could "sit 
on top of the database and audio analysis engine. 

A series of mock-up diagrams (Figs. 2-6) will serve as a "tour guide" of the application's functionality. These 
diagrams depict the application as having a menu-based user interface, but in practice such an interface might not 
always be the most suitable. For example, a sound designer or editor who is accustomed to audio equipment 
might find a menu-based interface less intuitive than one with physical knobs, buttons, sliders, and perhaps even a 
piano-like keyboard. For such a user, having to switch interface modes and devices can be highly disrupUve (A 
separate paper covering these user- interface issues is planned.) Note, however, that even if the browser used a 
menu-based interface, it might look nothing like the one illustrated here. The purpose of this secuon is to describe 
the core functionality, not to specify the user interface. 

S.l Outer Window 

Figure 2 shows access to all of the browser's functionality: it depicts the application with all menus engaged. 
Menus on the outer (top-level) window can be used to search for sounds and define new aural properties. The 
specific procedures supporting these operation are described in subsequent sections. The inner window displays a 
hierarchy of sorted lists, arranged as successive columns from left to right. 

The top-level window's Query menu enables the user to search for sounds, querying by value or by example. 
The Aural Properties menu lets the user define new aural properties or modify existing ones. The Analysis menu 
allows the user to view and control the execution of analyses. The Windows menu is discussed below. Standard 
cut/copy/paste functionality is provided by the Edit menu. 



79 



RN-00083 



(Initial a t]ii p ■ 



query by value- 
open... | query by example. 



save 
save as... 
close 



new ... 
open... 
save 
save as... 
close 



auto -J s ' 1 1 nt 
on reference 
when idle 
at time... 



new 
tile 

cascade 
add column... 



copy 
paste 
undo 



Category 
List 



Category 






I'd * 




add -> 






add -> 






add -> 


info... 






info... 






info- 


rename... 






rename... 






rename.. 


copy- 






copy- 






copy- 


move... 






move... 






move... 


delete 






delete 






delete 


sort... 






sort... 






sort... 



Keyword List 
(constrained 
by selected 
category) 



Sound List 
(constrained 
by selected 
keyword) 



name: [defaults to sound 
filename] 

dates 
created: 
analyzed: 



features: 

measured attributes: 
aural properties: 

keywords: 
categories: 

comments: 
system-supplied text 
user's text, 
voice annotations, 
graphic annotations 

history: 
sound revision history 
links: 

sound file 

score file 

synthesizer patch 
sonification source 

f 



( 



User can play selected sounds 
(individual sounds also playable via 
double-click in sound list) 



User can pop up related apps 



(EditlMix) 



Fig. 2 - An overview of the sound browser. Fully engaged menus show the basic functionality of the browser. 
Outer (top-level) window can contain multiple views of the database (inner window). Each view consists of 
multiple scrollable columns that list data field values such as categories and sounds. 
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3 

J 5.2 Inner (list) Windows 

, Thedisplayedorderingofthecolutnnsind^^ 

3 Figure 2.) By default, toe "^S^Stlfow^tta user"™ choose what sons of data fields to 

, £SbTse?b^ 

4 columns. 

, Forexarnp^eletoostcolu^^^^^ 

. updates to the middle list; subsequently selecting an item n the ™^ S ^S^ n ^ data b4; the 
used on columns. 

- Because categories can be hierarchical the : Category c — S«&tS^- 

folder's contents in the Macintosh Finder's "V.ew by Name mode. 

Menusareasscxtatedwitheachli^^ 

- selected item(s) in the list. For example, by ^ "S'S sound ffle can be entered into the 

-» annotate information about the objects using these menus. 

" Thec 0 lumnmFi g ure21abeledS O ^^ 

- selected sound(s). This window is not J f moV eSve1o the other columns; it is always 
■fl the other columns, the Sound Attributes '-^^^^^j^^^Bg it from the column to its left). 

rightmost and conceptual y ^^t^^^S^^ ounds the Sound Attributes column contains all 

- Some of the fields are editable. In the case of 1 3^*^^^. me Next and Previous buttons (not 

^ can use the info... command in the Sounds column. 

^ Itisoftenusefultohaves— 

- the Windows menu (at the top of the outer *^>'^ S ^K d£Jw the default list ordering: by 

2 windows. Figure 3 shows two multi-list window* .^f WJ^hv b? P sSnd, by keyword, by category. The 
~ category, by sound. The bottom list taESUi defined for a particular sound 

^ latter ordering is useful when one wants »J^^^g^ Likewise, the user can specify an ordering that 

3 and all the categories containing sounds ^"^toA^^u* bowsing all of the categories that 

— • are used to any record along with the selected keyword. 
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example you could drop a multiple selection from the Sound column of one window into the Category column of 
another window, in order to add the sounds to that category. 

5.3 Play and Edit/Mix Buttons 

The button labeled Play at the bottom of Figure 3 enables quick previewing of the selected sound(s). 
Toprovide the greatest uulity when using the b™ 

button labeled EditMix launches and communicates ™* " e *2^£S ^" y and paste portions of it, as 
displays the currently selected sound^waveform and ^^JSSdSStoSe s P ame sound file, the 

or their modification can trigger a new analysis.) 

r Jit/Mir k clicked the browser sends the editor the currently 
The editor also functions as a mixer. When EditMix is cuocea, me oro * window in 

SsTo i^ixed fog«te-*cfore sending tern to the mixmg apphamon. 

semi-automatic assembly of mixed sounds as more active (generative) role. To refine the mix, the user 

quickly assembling test mixes. 
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tile 
| cascade 
add column— 



Haiku 
Miro's Blues 
Bkgmd Ambi... 
Foregmd 
Glissandi 
Lines 
Blobs 
Dots 
Network 
Pedal Steal 

Raw Pedal Steel 
Phthong 
SampleBash 
Study 1 
Study 2 



faraway thunder 
fire crackling 
low-filter sweep 
tam-tam sustained 



name: fire crackling 

^ added: Jan 4 1993 10:55:47 
analyzed: Jan 4 1993 11:45:03 
accessed: Mar 3 1994 03:23:23 

categories: Miro's Blues/Bk... 
Haiku /Line2 

keywords:ambient, nature sound 
features: 



\ brakes 

j cable car tracks 
I chuckle 

clink 
| door slam 
, elephant bellow 
faraway thunder 
fire crackling 
i garbage truck 
kids playing 
| low-filter sweep 
saxophone 
tam-tam sustained 
i toink 



Keyword 

ambient 
nature sound 



Haiku 

Line2 
Miro's Blues 

Bkgmd Ambi... 
Studyl 
Study2 
bird songs 
water sounds 



name: fire crackling 

^Idded: Jan4 1993 10:55:47 
analyzed: Jan 4 1993 11 45:03j 
accessed: Mar 3 1994 03:23:23| 

categories: Miro's Blues/Bk..., 
Haiku /Line2 

keywords:ambient, nature sounc 
features: 



(ZED 



5.4 Aural Properties Menu 

Figure 4 illustrates the use of the Aural Properties menu. The user is able to define new aural properties by 
choosing new When setting up a new aural property, the user is prompted to choose between a continuous and a 
discrete property, as shown. (See section 3.3, above, for the mathematical treatment of these two types of 
property ) Ineither case, the user enters the name of the new property in the dialog that pops up; in the discrete 
casTthe user must also specify how many discrete values the property has. Then the user drags training sounds 
into the lists labeled "Example sounds." These sounds are typically dragged from a list window of the type shown 
in previous figures. 

In the discrete case, the user specifies the value for the sound by dragging it to the correct list. 

In the continuous case, the single list is ordered by value. An initial value is automatically assigned to a new 
sound by dividing the existing range into equal increments and giving the new sound a value corresponding to the 
position where it was dropped. The user can audition sounds by double-clicking them, and can drag sounds to a 
different position to reorder the list, automatically creating new values. For greater control, the user can explicitly 
type in values (or use the value slider to set the selected value). A value that has been explicitly set will not be 
automatically changed when the list is reordered or extended. 

At any point during the assembly of the example sound list, the user can request the system to analyze the aural 
property by clicking the See Analysis button. This brings up another window (not shown) in which the user can 
inspect the existing analysis, if any has been done, or initiate the analysis. When the user saves the property 
(using the save menu command), the system schedules the analysis of all sounds in the database, which are 
assigned values of the property that depend on their measured attributes. 

An aural property that has already been defined can be inspected and modified by choosing open, which brings up 
the same sort of panel that was used to initially define the property. 

The job of defining new aural properties should not be taken lightly in a multi-user environment, since it modifies 
every record in the database. It would probably make sense to restrict access to this menu or to issue warnings 
when it is used. 
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5.5 Analysis Menu 



The Analysis menu, pictured in Figure 2, contains a command to view the analysis results and another command 
to schedule when analyses should take place. The former command brings up a panel (not illustrated) from which 
the user selects which parameters to view. The latter command is described below. 

The sound analysis procedures are time-consuming. Therefore, various schemes exist for telling the system when 
it should take the time to analyze sounds that have been added to the database. Analysis-scheduling options 
include: 

• Automatic . The application triggers the analysis of a new sound immediately upon its entry into the database. 

• On Reference. The application triggers the analysis of a sound the first time it is explicidy referenced — for 
example, when a user selects a specific sound and asks the system to find sounds that are similar to it. In queries, 
only sounds which have been analyzed by the system are candidates in the search (if that search involves feature- 
list comparisons). 

• When Idle. The application uses its own idle-time to trigger analyses of sounds in the database. The schema 
contains flags which are used to determine which sounds have not yet been analyzed. 

• At Time . The application enables the user to schedule dates and times for the database analysis procedures. The 
user can specify that these time-consuming procedures run during convenient, low-load hours. 



5.6 Query Menu 

Figs. 5 and 6 demonstrate the query operations. The user constructs queries by specifying either search values or 
sounds. Figure 5 shows the Query->query by value technique. Here the user has specified a variety of search 
criteria including keywords, high-level perceptual features, and low-level acoustic attributes. (Almost any data 
field can be used as a criterion.) The criteria within and across the levels of description can be constrained using 
boolean operators such as and (&&), or (II), and exclusive-or ( A ). Arithmetic operators (less-than, greater-than, 
etc.) can be used where appropriate. The query by value method looks for static (as opposed to relative) 
membership — that is, it looks for sounds whose attribute values match or fall within the ranges of values specified 
in the search criteria. The search is confined by the current selection (if any) in the list window. For example, if 
one or more categories are currently selected in the Category column, only sounds in those categories are 
searched. If nothing has been selected in advance of the query operation, the system will test all sounds in the 
database. This feature can save time when constructing queries. 
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new ->query by value- 
open... jquery by example. 



save 
save as... 
close 



Search criteria: CZD 

keywords: CO 

(intro I I transition) && foreground 

Cj^ C &< D C ^ aural properties: CZD 

metallic GT 0.9 I I plucked GT 0.8 

Cj ^C & OC^ measured attributes: CO 

average pitch (GT 2000 Hz I I LT 300 Hz) && 
duration (GT 5.0 sec && LT 10.0 sec) 



Play 



[Edit/Mix] 



Fig. 5 - Query by value. The user clicked the "7" button to the right of the label "Search criteria:" and selected 
"keywords," "aural properties," and "measured attributes" from an ensuing "stay-put" list (not shown) of all known 
attributes. Each selected attribute is displayed below with its own '7" button for choosing or clearing values of the 
attribute and for choosing Boolean operations on the values. The chosen criteria (keywords, aural properties, and 
measured attributes in this example) can be themselves be connected with Boolean operations, using the buttons to 
the left of the attribute name. The Stop button halts an in-progress search. 
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Figure 6 shows the Query-> query by example technique. Here the system will test for proximity to a selected 
source sound: it will search for sounds whose features (measured attributes and aural properties) have values 
similar to those of the user-selected (source) sound. (This proximity rating is based on a metric, not described 
here, that scales the values for the various parameters depending on their psychoacoustic characteristics.) One can 
think of this query as a relative (as opposed to a static) search. The Query by Example dialog contains settings lor 
"very similar," "somewhat similar," and "very dissimilar." (It might be desirable to let the user specify arbitrary 
amounts of similarity.) The source sound for the search should be selected by the user prior to invoking the 
Query->query by example operation. If the user omits this step, the Query by Example dialog will issue a 
prompt. The source sound may be any sound in the database, an accessible sound file which is outside the 
database, or a sound read in through the machine s sound port. Only sounds that have been analyzed are tested 
(compared) during any search operation. If the user has configured the system for automatic analysis (see Figure 
2), any unanalyzed sounds within the search path will be processed. The query can be constrained to compare 
particular parameters: pitch, loudness, harmonicity, brightness, formants, and/or duration. Combinations of these 
constraints are possible. To "zoom" into a parameter for finer control, the user can double-click on the 
parameter's box to bring up the components of that parameter's trajectory (average, maximum, minimum, 
variance, autocorrelation, and shape parameters.) 

Results for query by value and query by example are displayed identically. The results are displayed in a window 
that is almost the same as a regular multi-list window. The columns can be re-organized as usual, and items 
within a column can be sorted as usual. One can select multiple entries (sounds) within the results window and 
then invoke the Query->query by example operation again. This can be a useful means of refining a query. The 
Play and Edit/Mix buttons function on the currently selected sound(s), just as they would with a regular multi-list 
window. One can also select sounds in the results window and drag them to another window, such as the Aural 
Properties window or the Category column of a multi-list window. 

The only difference between a Query Results window and a normal multi-list window is that the Query Results 
window includes a button called see query. This pops up the query window that generated the results window (in 
case the query window was obscured or closed), so that the user can modify the query. 

It is easy to imagine browsers for other media (video, stills, animation) communicating search criteria to the audio 
browser. One can extend this concept, slightly, to envision a video clip browser that can automatically translate 
video search criteria into relevant search criteria for the sound browser. Effectively, through inter-application 
communication, the video browser would graze the sound-effects database and construct a mix-list of sounds to 
accompany a mix-list of video clips. This could be very useful for multimedia document authors, especially 
during stages of prototyping. The audio browser could similarly communicate with other application programs 
geared toward other usages (such as medical analyses). 
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new -: query by value... 
open... query by example- 



dose 



save 
save as.. 




Find sounds that are: 
|very similar 

□ somewhat similar 
□very dissimilar 

with respect to features of: 

□ pitch □ loudness □ brightness □ duration 

□ harmonicity □ formants ■ all 



Fig. 6- Query by example. User asks the system to produce a list of sounds that are very similar, somewhat similar, 
or very dissimilar to the source sound. The source sound is the single sound currently selected in the list window (not 
shown). The proximity (comparisons) can be made with respect to all the sound's measured attributes (the default), 
or with respect to duration or to a certain group of attributes (those related to pitch, loudness, or spectrum). 
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5. Conclusion 



This paper has outlined some of the main features of a proposed analysis engine and front-end application for 
content-based retrieval of audio. 

There are many other possible features and approaches for such a system. For example, it might be desirable to 
allow the analysis engine to operate on sounds not stored in the database. In this way, a user could acoustically 
analyze a sound file prior to entering it in the database, in order to determine its suitability. 

As another example, a supplementary sound synthesis feature could assist the user in making queries. When the 
user was unsure what values to use, the synthesis feature could create sound examples using different user- 
specified values. The user could approximate the desired sound by testing different values for each of the 
measured attributes, refining the synthesized example until it bore enough similarity to the desired sort of sound. 
(The synthesis might also include signal processing of stored sounds, so that the user could select an existing 
sound as a starting-point for refinement.) 

As mentioned earlier, this is work in progress. Further implementation and testing of our system will reveal 
whether the chosen acoustical attributes are sufficient (or excessive) for usefully analyzing and classifying most • 
"atomic" sounds (i.e., brief sound effects). Practice with the system will also undoubtedly suggest various 
modifications and refinements to the prototype user interface. We believe, however, that the basic approach 
presented here is feasible for a wide variety of audio database applications. 
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ji. ,»„u..««« ».» »t. Dar ,ci of the type used for remote control of a domesuc vioeo 

y - ■ - ~c >ka /-line availahle and 



i 1 (codec* hardware lor video ana auaio nas given us m 
ol jortunitv to provide multi-media applications over data net- 
works. This paper describes a particular system that we have 
built to demonstrate this capability. A video jukebox has been 
built using off-the-shelf computers connected over Ethernet. 
Kor wide area connectivity, SMDS attachment has been used. 
Combined video and audio clips can be selectively played on a PC 
equipped with low cost video and audio codec cards, using JPEG 

jnd ADPCM compression hardware. The data for these clips is , „, M ., „ . . 

,, ainable from a number of network configurations is shown The prototype system uses Ethernet to connec 
fr, in the results of the experiments we have conducted. These and server over t he local area. Experiments have 
poults show that the addition of real-time transport software on conducted wiln tne client and server on separate LANs inter 
top of the standard Internet protocols allows us it. provide video viaaTl SMDS link [ l ]. The operation of the video 
■ and audio clips with a system comprising «r^ r ggff? ^™ ~ ™* modest bandw.dth is enabled by 
equipment The quality of these clips is considered acceptable ror 
., ann ii<.oti nn c w> hav* nuantifled the capacity of the system 



panel or tne type uxu iui .v...«~ v- — 

recorder. The user can view a list of the clips available and 
control the display of a clip through the selection of the 
appropriate buttons on this panel. All of the data for the clips 
is stored on the hard disk of a remote server with which the 
client communicates over an intervening network. The server 
is a Unix workstation. Applications for this specific type ot 
jystem include training, education anc 

The prototype system uses Ethernet to connect the c 
and server over the local area. Experiments have also been 
. ... —j ,~„„... ™ cnarate I AW inter- 



equipment The quality of these cups is consiaerea accep«u« ™ 
many applications. We have quantified the capacity of the system 
- --is of the data rates and the number of users, and have 
•d which parts of the total system become bottlenecks as 
these parameters is increased. Solutions to overcoming 



tl se 



connccicu via a » 1 j"'"- - 1-1- — r 

jukebox over networks with modest bandwidth is enabled by 
the use of compressed video and audio. The PC contains two 
cards that provide compression and decompression of video 
and audio in real time. Thus the PC is used to record clips, 
which are then transferred to the server for storage. As the 
jukebox client, the same PC decompresses the clip during 
play back. 

I. Introduction ^ motiva ,i 0 n for the work reported here is to examine the 

THE availability of low-cost compression/decompression requirements and possible solutions for networked multi-media 
(codec) hardware for video and audio has given us the applications from a systems perspective. The video jukebox is 
capability to provide multi-media applications over data net- m exarnp , e 0 f a specific class of application that focuses this 
works. Data networks that can support the reduced bandwidth , study previous articles in the literature have looked at similar 
of compressed video and audio are in place and readily system s from either a qualitative [2] or theoretical [3] | point ot 
a liable. Powerful desk-top computers that can process this view ms pap er complements that work with a study based 
d i are rapidly advancing to the point where they are com- on measurements from a working system, 
m. dtty items. Together, these technologies present an exciting -p^ components of the system can be categonzed into the 
opportunity to provide applications that utilize video and audio following areas. 
10 users now. . romouters for the client and server. 



io users now. 

The acceptance of new applications employing multi-media 
is dependent on a balance between price and performance. It 
is therefore appropriate to ask just what is achievable with 
the . technology that is available, and to attempt to identify 
lii- nations in systems built with this technology. Further, we 

' ild quantify how the capabilities of a system increase as 
make advances in each of the limiting areas. 



Computers for the client and server. 

• Video and audio subsystems. 

• Network connecting client and server. 

• Network services. 

• Application software. 

In building and testing the video jukebox, our aim is to 
isolate the particular problems presented by each of these 
. • ic. :ki- cn ii.hnnc The main ob- 



* make advances in each of the limiting areas. lsolate me "vL^, solutions. The main ob- 

This paper adopts this approach, using the design and components , and to .dentrfy ^JJtj!^ p«ible. «d 
performance of a networked video jukebox that we have built Sons are with d* approach, 

as the bas,s of our study. The video jukebox allows a user to da mons trate whe « * c ™ 0 in tcrms of 

to view video clips, with stereo * -tie. on a PC equipped We have quantified _*« ^capac ityofthe sys 
with l0 w cost video and audio codec card, using JPEG and ^cTbo^ecks as each of 

j c~i...;»„e tr\ rvixMY-nmino these 
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which parts oi me iouu sysicm uw^.. 

these parameters is increased. Solutions to overcoming these 
bottlenecks are described. ' 

We have shown that a working system can be built using 
off-the-shelf parts, together with some software for end-to-end 
protocols. This system provides video and audio with a quality 
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considered acceptable for many applications. Specifically, we 
have made the following measurements. 

• A standard 386 PC client, equipped with video and audio 
hardware, connected to a standard HP series 400 Unix 
workstation configured as a video and audio file server, 
is sufficient to support a compressed video stream of 
256 x 160 pixels at 20 frames/sec in full color alongside 
a stereo audio channel of 4 bit ADPCM. The bit rate of 
such a stream is ~ 0.8 MBit/s. 

• Up to 8 such streams could be supported on a single 802.3 
network dedicated to this application, without noticeable 
degradation of the image or audio quality, and in addition 
a single stream can be supported over an SMDS WAN 
using the current prototype HP 802.3-SMDS router. 

. The number of clients that can be served simultaneously 
by a server is dependent on disk performance and the 
distribution of data blocks over the disk. In the worst case, 
without imposing any constraints on this distribution or 
on the scheduling policy in the file system, the HP 400 
workstation used in our experiment could support about 
8 clients, each reading a separate video and audio clip. 
At the time of writing, the latest disk technology could 
be expected to double this number. 
Our general observations on how to extend networked mul- 
timedia systems to handle higher data rates, more clients, and 
other classes of applications, are described in the concluding 
section of this paper. 

In the next section of this paper, an overview of the system 
architecture is given which covers the configuration of the 
components and the video and audio subsystem. The following 
section then gives an analysis of the performance of the system 
components and defines the limits within which the system 
will operate. The network services and software design are 
described in Section IV. Section V shows the results of the 
measurements that have been made on the system in different 
configurations and Section VI discusses the implications of 
these results. The paper is concluded in the last section. 

II. System ARCHrrEcruRE 

/. Hardware Configuration 

The hardware consists of a client HP Vectra RS25/C PC 
equipped with prototype video and audio decompression hard- 
ware 1 . connected via standard network interfaces to an HP 
9000/400 series workstation server (Fig. I). 

The most simple network configuration for connecting the 
client and the server used a private Ethernet LAN. The only 
other station on this LAN was a network monitoring tool which 
allowed detailed network traffic measurements to be made. 
Further experiments were conducted with the system in the 
presence of other LAN traffic and also with a connection over 
an SMDS wide area network. This last configuration is shown 
in Fig. 2. 
l.A. Video Subsystem 

The Videologic prototype video codec was built around 
a C 3 Microsystems JPEG compression/decompression chip 





Fig. !. Video Jukebox Hardware Configuration. 




1 Manufactured by Videologic Limited. 
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controlled by an on card transputer together with a 256 KByte 
FIFO for incoming and outgoing data storage. The transputer 
controls the interface to the host PC, the setting up o< the 
compression chip, and the management of the FIFO he 
prototype transputer code allows the user to select the image 
size to be coded together with the frame rate and quant.zation 
levels. 

The JPEG standard is aimed at continuous tone still image 
compression so there is no inter-frame compression. "Hie 
addition of inter-frame compression such as in the MPfcu 
standard would increase the compression by a factor of abou 
3. While MPEG is targeted at 1.5 Mb/s, there is no irtl en 
reason that the standard could not be used at the lower bi ate* 
used in these experiments to obtain higher quality imago an 
audio. (For a detailed description of JPEG and MPEG, see tn«. 
articles in [4].) . f . 

Fig. 3 shows the JPEG-coded frame sizes.against time 
the trailer sequence of the film 'Buster' which was used m 
experimental sections of this work. The scene changes in ' 
sequence can be clearly seen. Typical variations with JPEG 
this type of material are of the order 3:1, considerably 
than the 10: 1 variations which could be expected using ' 
It is important to note that different image sequences *° 
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local file system. These are obtained together with the display 
time from the index file. The frame and its display time are 
transferred, queued in the local FIFO, and then displayed for 



have quite different characteristics. A sequence that consisted 
jf a long talking-head' shot followed by a steady text frame 
would effectively be at two distinct data rates. 
l.B. Audio Subsystem 

The audio codec is a general purpose DSP chip which is _ 

programmed to implement two channels of 4 bit ADPCM arc lost between recording and playback. If they a.v 
ci repression on 16 bit input samples at sampling rates of synchronization between the audio and video channel is lost, 
er 9.45 kHz. 18.9 kHz or 37.8 kHz. This card also has a The implications of this in a networked environment are 



the required amount of time. 
This scheme inherently makes the assumption that no frames 



a strolling transputer and a 64 Kbyte FIFO which corresponds 
to around 6 seconds of audio at a sample rate of 9.45 kHz and 
8 bits per stereo sample. 

In our experiments we used stereo 4 bit ADPCM at the 
lowest sample rate and played the audio back through small 



discussed in Section VI. 

l.D. Video Overlay and Capture 

The video overlay card is connected to the compression card 
via a digital bus that is closely related to the CC1R 601 digital 
ideo standard. Using this bus decompressed video from the 
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tintte speakers. While this is a highly subjective result, we .compression card can be chromakeyed into a display ^w ndow 

'or alternatively analog video can be frame grabbed, digitized. 

and fed to the compression card. 

When in capture mode the image size can be scaled before 
grabbing to a variety of sizes from 192 x 128 pixels to 
352 x 256. We mostly used 256 x 160 which occupies roughly 
a quarter of a VGA screen. During playback it is possible 
using this card to expand the picture so that the original 
255 x 160 image can be blown up to 484 x 320 pixels, which 
is roughly 50% of the graphics area. With the very small 
viewing distances that are used with computer displays this 
picture zooming is not particularly -useful— the image defects 
are highlighted and are very noticeable. However, at viewing 
distances more closely associated with television the picture 
looks similar in quality to VHS video. 



judged the quality to be similar to high quality AM radio, and 
qi :e effective in stereo for replaying a film soundtrack. 
C. Video and Audio Synchronization 
-ideoLogic's prototype video and audio system was orig- 
inally designed to store and replay video and audio from 
a local disk. The modifications necessary to run the system 
using a remote filestore were carried out at Hewlett-Packard 
Labs" by the authors. In Videologics original prototype stand- 
alone system there is no explicit synchronization recovery 
mechanism. The compressed audio and video data are stored 
m oparate files. A third file is created during recording and 
us i for two functions. The first is to index on a frame by 
tr :,ie basis into the video data file. This is required because 
ot the variable compressed frame length. The second function 
is to retain a list of frame display times associated with each 
frame. This is required if. due to host system overheads, it is 
not possible to continuously transfer data to the local disc at the 
required rate so that recorded frames are no longer compressed 
at the nominal rate. 

Ti play back a clip in the stand-alone system, the decom- 
pt >ion subsystem first opens the locally stored files, gets 
th- audio and video parameters, and then displays the first 
video frame in the clip. It then requests a block of audio 
data samples, queues them in its local FIFO and starts playing 
them back. Requests are then issued for video frames from the 



?.. Software Configuration 

The software architecture is illustrated in Fig. 4. The PC 
software runs under Windows 3.0a running in enhanced mode. 
The top level user interface program is a windows program 
that resembles a video recorder remote control. Menus allow 
dialog boxes to be pulled up which in tum allow a video clip 
to be selected and played. Control messages from the user 
interface are passed to the main system driver module which 
runs as a DOS virtual machine. 

RN-01970 



E TRANSACTIONS ON CIRCUITS *ND SYSTEMS FOR VIDEO TECHNOLOGY. V 



This software architecture takes advantage of the ability of 



Data loss on the video channel is not so instantly noticeable. 

ASSESS ^^HSS3 

a token from the decompression subsystem, requests buffer 
space from the buffer manager and then, if the decompressor 
has room for the data in its receive FIFO, it looks at the top 
of the data stream layer to see which type of data is waiting 
to be received. The data, which in the case of video must be a 
whole frame, is copied into the shared buffer and the token is 
handed back to the decompression subsystem with information 
on the type and amount of data in the shared buffer. This data 
is then asynchronously read, queued, decoded, and displayed 
bv the decompression subsystem. 
' The server software operates in user space and. similarly 
to the PC. the underlying communications software is HP's 
Berkeley sockets implementation. The video and audio files 
are stored on the workstations internal disk as part of its normal 
file system. Storage on to a raw disk partition has not been 
used in this implementation. 



The variance of the data arrival at the compression subsv 
tem is set by a combination of five processes: 

• Server data read function. 

• Server data transfer function to its network subsystem. 

• Network access delay. 

• Client network read function. 

• Client data transfer function to the decompression < 
system. 

The significance of each of these processes is assesses 
the following sections. 



,ub- 



RN-01971 



premise of the networked video server is that the 
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The basic r 

isochronous video and audio data can be read from a remote 
disk and transferred to the client over an existing computer 
data network at a controlled rate. The important factors in 
determining the perceived audio and video quality are the 
mean data rate and the data loss rate. Data loss in this context 
effectively includes data that arrives too late or out of order. 

Data loss can be caused both by bit errors on the network 
and by overflowing or emptying of data buffers. Data loss on 
the audio channel results in very noticeable clicks and pops. 
Fortunately, even though the coding is essentially differential 
on a sample by sample basis, the data is packetised into self 
contained chunks which can be played without reference to 
previous data. Thus loss of a packet does not cause all the 
following data to be unplayable. 



2. Server Data Read Function 

On the server the system clock can be used to control 
the read and data transfer processes to try and maintain an 
isochronous data transfer to the network subsystem. The pro- 
cesses were run in user space on a standard Unix worksi on 
so the accuracy with which this can be done depends o 
clock itself and on any other operating system activity. 

The mechanism used to read the system clock on the server 
gave a clock resolution of 0.1 milliseconds. This effecmei> 
sets the lowrst limit on the isochronicity of the server reao 
and transfer process. 

The operating system activities can be divided into i» 
components. Those directly related to the user process, speu^ 
ically data caching and read time on the file system, am 
entirely independent of the user process. 

Data caching by the kernel causes data to be read » 
the file system in file system block size units and stored 
memory. As a result file system reads at contiguous Ration 
within a file will take a variable amount of time depenu . 
on whether all the data is in the cache already and how m 
extra data is being read. The caching algorithm employed 
HP-UX 8.0 is quite simple: when the last fragment in a w 
is read as part of a sequential read, the next block of ■ 
is also read into the buffer cache. So the last read on a 
block incurs the overhead of reading the next complete 
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The read time on the tile system is dependent primarily on 
the disk performance. Disk seek times at around 16-17 ms 
plus 8 ms of rotational latency for 5.25" disks are available 
on Hewlett-Packard's current workstations and some improve- 
ments can be expected with smaller disks. The seek, settling, 
and rotational delays significantly reduce the disk performance 
if multiple streams are read randomly off the same disk. In 
worst case if the block size is 8 Kbytes and successive 

>cks are distributed across the disk then the throughput of 
.. 4 MByte/s 5.25" disk falls to ~ 300 KByte/s. which would 
only be enough to support a couple of 1 MBit/s isochronous 
creams. Increasing the block size to 32 KBytes increases the 
throughput to around 1 Mbyte/s. enough for 8 independent 
streams. A disk rated at 10 MByte/s with a total seek and 
rotational latency of around 12 ms. transferring random 32 
KByte blocks would have a throughput of around 2 Mbyte/s, 
efficient for 16' independent 1 Mbit/s streams. It should be 
; ted that these figures are for worst case distributions of data 
b:ocks. 

The application independent factors affecting the server data 
read process can be minimized by leaving only the basic 
ivstem processes running and ensuring that there is sufficient 
memory available on the machine to avoid paging. As the 
experimental results will show, these processes are negligible 
*hen compared to the disk performance. 

. Server Data Transfer Function 

The server data transfer function is affected by the same 
independent operating system factors as the read process. The 
server dependent activities are as follows: 

• The data copy from the user buffer to a kernel buffer. 

• The protocol processing to convert the user data into a 
packet suitable for transfer to the network card. 

• The queueing time and copying time from the kernel 
buffer to the network card. 

Variable delay can occur in the execution time of the send 
call if either an independent system process is scheduled or if 
the network card has insufficient buffers available to allow the 
transfer. The latter will only occur if previous packets are still 
queued in the card because of excessive network access delays 
or if the required peak data transfer rate cannot be supported 
by the card. 

■i \er\vork Access Delay 

The local network available to us for this work was Ethernet 
others are considered in Section IV). For experiments in the 
wide area we had access to a point.-to-point 1 .5 MBit/s SMDS 
link via a prototype gateway. 

The video server could operate in one of three LAN 
snvironments. The first and simplest would be a private 
network with only one server and multiple clients receiving 
the same data. The second case would be with two or more 
ers on a private network supporting a number of clients. 
T . final case would be a server or servers on a shared LAN 
co-existing with some unknown mixture of other applications. 

With only a single transmitter on an Ethernet the net- 
work access delay is effectively zero. While this is not a 
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particularly startling result it does serve to remind us that 
network protocols that allow multicast transmission, do not 
use acknowledgements, and do not congest the network with 
management activity, can enable a single server to broadcast 
video and audio streams to multiple clients with only the 
propagation delay to consider. 

The behavior of the system with two servers on an Ethernet 
network depends to some extent on the design of the network 
interfaces. If the network interfaces assert an interrupt between 
the transmission of each packet then it is likely that the 
interrupt service time will be longer than the Ethernet inter- 
packet time. Thus two servers would quickly synchronize: one 
defers to the other and then transmits without contention from 
the first. Thus the variability of the delay experienced by each 
packet would effectively be restricted to the maximal length 
packet transmit time of around 1 .2 millisecond. 

The situation becomes almost impossible to predict if the 
server! s) have to coexist on a shared LAN. Since Ethernet does 
not provide any MAC layer priorities or service guarantees 
it is not possible to separate out the isochronous data from 
the ordinary computer data. If any headway is to be made 
in characterizing the possible performance, some assumptions 
have to be made on all of the following: 

• Distribution and number of machines on the network. 

• Packet length distribution from each machine. 

• Data load on each machine and the impact of higher level 
flow-control and buffering. 

Some observations can be made. In many environments the 
long term average LAN utilization is of the order of just a few 
percent. Significant increases in the utilization occur during the 
course of the day where the minute by minute average might 
peak at 5-10%. In the very short term. i.e. on a second by 
'second basis, very high peaks corresponding to file transfers 
may occur. In a laboratory environment our own measurements 
and those shown in [6} indicate that packet distributions are 
bimodal with the peaks towards the shortest and at the longest 
packet lengths. 

For the video server the most important measure of the 
network is the variability of the packet delay. An extensive 
experimental study [7] conducted measurements on the stan- 
dard deviation of the packet transmission delay for bimodal 
packet length distributions on a network comprising two sets 
of clustered hosts. This study indicates that for a balanced 
load the delay variation increases roughly linearly with the 
number of hosts going from 5 to 25. For a packet length 
distribution of 6 maximal length packets for every 2 minimal 
length and 20 hosts in 2 clusters on a 2000 foot Ethernet, 
the average transmission delay was 20 milliseconds with a 
standard deviation of 60 milliseconds. These delays were 
incurred with the ethemet utilization at 9.3 MBit/s. With just 
five hosts the average was around 5 milliseconds, with a 
standard deviation of 20 milliseconds at a similar average 
network utilization. 

5. Client Network Read Process 

The client network read process comprises the servicing of 
incoming packet interrupts from the network, the transfer of 
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the environment in which we are operating. At a nominal data 
rate of I MBit/s this corresponds to roughly 6-60 Kbytes ol 
data. 

With the video jukebox application, a start up delay before 
plavina a clip of 0.5-1 seconds is not objectionable. This 
should" allow us to pre-till the receive FIFO's sufficiently 
to smooth out the received data delay variations. In the 



the data to the host, and the protocol processing. The packet 
interrupt service time on the PC should be constant. The 
variability comes from the process of reading from the network 
subsystem (i.e. the top of the machine's protocol stack) and 
copvine the data into application memory. The client machine 
is a PC running Windows 3.0a in enhanced mode. The network 
read process, and the data transfer process, are implemented — 

in a DOS Virtual machine In enhanced mode Windows time expenments reported here the delay was set at 500 
slices between the DOS virtual machines that are running and 
Windows itself. The length of the time-slice can be adjusted as 
can the effective number of time-slices that the virtual machine 
has. A balance needs to be struck between the throughput that 
the transfer process can achieve and the responsiveness of the 
top level windows application. 

The variation in the delay incurred in the transfer process 
from the network card will be a function of the time-slice 
penod. In our experiments we varied the time-slice between 
10 and 60 milliseconds and changed the ratio of the number 
of time slices each process received from 1:1 to 5000:1. 



IV. Protocols 

There are two distinct sets of requirements for protocols f. 
multimedia applications. There is a need for signalling, boi,: 
between hosts and within the network, and also data transport. 
End-to-end signalling allows hosts to exchange information 
for the control of the different components of an application, 
while host-network and intra-network signalling is required 
to establish and maintain the connections for information 
exchange. The requirements for data transport are the same 
as those for many other applications, such as file transit 
or electronic mail, but there is an additional requirement 
exchange data on a time scale that is determined by u 
rather than by the capabilities of the underlying 



6. Client Data Transfer Process 

The client data transfer process takes data that has been app ii c; 
read from the network subsystem and. in conjunction with syste m. 
the compression subsystem processor, transfers this data into i„ ij ne with the philosophy of building a system based on 
the FIFOs on the compression cards. The basic mechanism off-the-shelf components, we have attempted to use existing 
is triggered bv receiving data off the network. The data is protocols, as provided, where possible. However, there is 
copied as described above, and the transfer process notifies currently no standard for signalling for multimedia systems, 
the compression subsystem. If there is sufficient space on the and n0 off-the-shelf implementation of a protocol for real- 
card FIFOs a message is returned and transfer process copies time data transport. In both cases, these are active area 
the data in to a shared memory area on the PC. This data is then resea rch and development. Signalling for multimedia serv . > 
asvnchronously collected and removed by the compression is addressed in [8] for example, while problems and solutions 
card processor for connection management are described in [9]. For a discus- 

Trie client process is part of the same virtual machine that is slon of the different approaches to real-time data transport, 
described above and is therefore time sliced. The collection of m [101. For the system described in this paper, we have 
data by the compression card processor is interrupt driven and dcsignc d and implemented our own solutions for signalling 
therefore should have relatively little impact on the variability and data transport. In each case, the protocols are implemented 

as software processes that run on top of standard Internet 
protocols, thereby maintaining a standard interface t. he 
underlying network. The remainder of this section dese .rws 
the design and implementation of these protocols. 



/. Architecture 



of the data transfer delay. 
7. Summary 

From the above analysis only the following appear to be 
able to cause significant variation in the delay time of the data 
transferred from the server to the client: 
. Server disk read times (5-25 ms variations). The protocol architecture 'i?*' ™ daU 

• Network access delay on a very heavily loaded Ethernet stack and the signalling stack, as illustrated m Fig. 5. I ne a 
T 5 80 1 vacations) stack is responsible for sending and receiving data m real 

• Operating system time slicing on the PC (10-100 ms time. The signalling stack is responsible for application 
variations) services and connection management. • 

ZZ worst case vanauon in data arrival times that For real-time data -P^— 
should be seen at the receive FIFOs on the decompression at a rate that is ^^^^^^ of J* 
subsystem will be around 20-200 ms, depending on precisely layer .nterfaces to the file system for presentai ^ 
how the system is set up and in what network environment 



e of i hc 



Since the distribution of actual delay times is unknown 
it is not possible to accurately calculate a given FIFO size 
for a given probability of the FIFO emptying or overfilling. 
However these figures imply that we should be looking to 
queue between 50 and 500 milliseconds of data depending on 



input. To optimally match the data rate and- the size c 
data units at the presentation layer to those of the underl><^ 
network, the segmentation layer may repackage the data »n 
appropriate. A receiver is required to receive the data ana 
reassemble the data to reconstitute the original data sir * 
Here, the transport interfaces directly with the video an 
subsystem. RN-01973 
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5. Proiocoi Architecture. 

The signalling stack comprises the services for user inter- 
action between the client and server, and those for call and 
connection management. The application services provide the 
remote interaction between the commands available to a user 
of the system and the corresponding operations on the server. 
Simple examples of these services are play and stop. The call 
inagement interfaces between the total network requirements 
. the application, and the underlying connection layer. The 
connection layer is responsible for the control of individual 
connections between the hosts. 

The services provided in each stack use the standard IP 
protocol, together with either TCP or UDP. for the underlying 
network and transport services. The signalling stack requires , 
reliable end-to-end communication and therefore uses TCP 
connections for the transfer of control information. For real- 
;e data transfer, functions such as reliability and flow control 
i e been left to higher layers. Hence this stack uses UDP 
uatagrams. The link and physical layer services are used as 
provided, according to the connectivity available. To date, we 
have used 802.3 and SMDS. but others are possible. 

By maintaining a standard interface between the data and 
signalling stacks and the underlying transport layer, we gain 
the advantage of complete portability among a large number 
of •<ysiems. However, there are two disadvantages. Firstly, this 
a iroach does not lead to the most efficient implementation. 
' ondly, it is difficult to incorporate any notion of priority in 
t).c system, either between the two stacks or among different 
connections in a single stack. Although neither of the link-layer 
protocols that have been used incorporate a model of priority, 
there are others that do. In these cases, it would be difficult to 
exploit this facility. This is discussed further in Section V. 

2. Senices 

\ brief description of the services provided at each layer of 
tl c protocol stacks is given here. 

At the top layer of the signalling stack, the application layer, 
the following services are provided. 



• init_NVCR(server_name). Before any other services are 
used, this is necessary to initiate a control connection to 
the server. 

• end_NVCR. This closed the control connection. 

• play(clip_name). To start playing a clip comprising video 
and/or audio. 

• stop(clip_name). This service is terminal in that it ends 
the session for this particular clip, and releases all asso- 
ciated resources. 

• pause (clip_name). To freeze the specific clip. All of the 
resources associated with the clip are retained. 

• resume(clip_name>. To continue a clip from the point at 
which it was paused. 

To play or stop a clip, the application layer employs the 
services of the call layer. The following services are provided 
at this layer. 

• create_call (channel [n]). The call comprises all of the 
channels required by the application, with all relevant 
details describing the channels, such as the quality of 
service of each component. 

• close_call. 

The protocols employed by the call layer to establish the 
end-to-end communication requirements of the application 
are dependent on the underlying network. For simple con- 
figurations, the call layer itself is sufficient to control this 
communication. In the general case, the call layer can make 
use of the connection layer to establish and control connections 
individually. This layer is based on the Internet ST protocol 
[II]. Currently, only the following services are provided by 
the connection layer. 

• open_connection. 

• close_connection. 

Other services defined in ST include those related to the 
modification of a connection, either in terms of the number 
of targets or the quality of service. These are not used in the 
system described in this paper, as the application is point-to- 
point and none of the networks used provide quality -of-service 
guarantees. 

The signalling protocols operate out-of-band from the data. 
Within a host, the interface between the two stacks lies 
between the application-level signalling and the real-time 
transport. At this interface, the application can invoke services 
in the data stack. For a transmitting host, the transport layer 
offers the following services. 

• startjx (channeled, source_id. rate). To initiate transmis- 
sion~of data from the source specified, which is a file 
descriptor here, on the given channel, which is already 
established and again is specified as a file descriptor. The 
rate specification can be absolute, in terms of by:es/s, 
or can be a continuous function specified in a hie. The 
former case is applicable to audio, where the data rate is 
constant, while the latter is applicable to video where the 
data rate is continuously variable during the clip. 

• stop_tx<channel_id). To cease transmission and release 
the resources associated with this transmission at the 
transport layer. 
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• pause_tx(channel_id). To stop transmuting while retain- 
ing the channel resources. 

• resume_tx( channeled). To continue transmission from 
the point at which a pause was issued. 

The complimentary receive services are provided by the 
transport layer on a receiving host. One major difference 
between the two is that the receiver has no direct control over 
the data rates so. for a loss-less service, the receiver transport 
must be able to receive the data at the rate that has been 
negotiated by the signalling protocols during the set-up of the 
session. It is a matter for the call-layer protocols to ensure that 
there are sufficient resources at each end to satisfy the quality 
of service requested for the clip. 

3. Design 

The protocols are designed as a set of asynchronously com- 
municating finite state machines. Each protocol is specified as 
a data structure comprising an array of states, each state having 
an array of triples of the form (event, next.state. action). The 
data structures are interpreted by a state machine executor 
(SME). The SME is triggered by events that occur in the 
system. There are four sources of event in the system. 

Commands from the user. 

Information received from the network. 

Events generated internally during state machine 

transitions. 
Internally generated time-outs. 

The system receives these events in different ways. For 
example, events resulting from user commands are received 
via an interface to the windows application while events from 
the network occur via packets received from the network 
interface. In all cases the event is translated into a standard 
format known as a signal. These signals are placed in a single 
FIFO queue, which is read by the SME. 

This design has a regular structure which allows the proto- 
cols to be easily modified and extended. However, it does not 
lead to the most efficient implementation in terms of execution 
time. For this reason, the control path, which includes all of the 
Mgnalling stack and the interface between this stack and the 
data stack, has been designed in the way described so far. but 
^ome of the indirection has been by-passed in the design of the 
real-time data path. Once a data channel has been established, 
there is a direct path of execution from the data source to the 
network interface on the sending host. Similarly, there is a 
direct path from the network interface to the video and audio 
subsystem on the receiving host. Experience with the working 
system has shown us that this decision was the correct one 
in order to achieve the required data throughput throughout 
the system. 

For convenience, a menu containing a collection of useful 
operations has been incorporated into the system. This is 
usually dormant, but can be activated from the keyboard at any 
time. The facilities provided from the menu include the ability 
for an operator to examine and manipulate the signal queue, 
and to examine statistics that are collected during system 
operation. 



make control connection; 

if ((event from windows interface) or 

(event from network control interface) 
(event from internal timers)) 
translate event to signal; 
place signal on queue; 
if (signal on queue) 
remove signal; 
invoke SME with signal; 
for (each active i/p data channels) 
if (there is any data to be read) 
read the data from network channel ; 
pass the data to VA subsystem; 
update clock; 

for (each active o/p data channels) 
if (it is time to send some data) 
read the data from source file; 
send the data to network channel ; 
if (keyboard input) 
invoke menu; 

Fig. 6. Structure of Protocol Software. 



4. Implementation 

The system is implemented as a single user process in C. 
The underlying network system is built from BSD sockets. 
The software has been written so that it can operate either as 
a client or a server, with minimal changes. The few changes 
required are controlled by compile-time options. By minimiz- 
ing dependencies on system calls, the code is totally portable 
between either Unix or DOS based machines, assuming ' 
presence of a sockets library for DOS. 

An overview of the implementation is given by the pseudo- 
code in Fig. 6. On start-up, the system first attempts to make 
the control connection. If it is operating as a server, it waits 
indefinitely for the client to make contact. The client will 
do this in response to a user command from the windows 
application. Once the connection is established, the main 
control cycle is entered. Here, the system cycles a loop in 
which it alternately generates and services control eventv ill 
then services the set of data paths that have been establi -J- 
Thus events generated from the sources listed in Section l\ ■? 
are detected and translated to signals on the signal queue. A 
signal may then be removed from the queue and used to invoke 
the SME. Following this, the data path is serviced. The set 
of active data input channels is checked for incoming data. 
Any data present is read and transferred to the video/audio 
subsystem. The data output channels are then serviced The 
elapsed time since the last cycle is checked against the tra «" 
period for each channel. If it is time to send data, it is rea*. ,m 
the source file and sent across the network. The final action 
is to invoke the facilities menu if an operator has requested 
it from the keyboard. 

It is important that only one event is serviced during eacn 
pass through the cycle. The asynchronous communication 
between the state machines ensures that it is not possible for 
a sequence of control actions to occupy the CPU contiguously 
at any time. In this way, the execution of the data path h • e 
some priority, and the maximum possible data throt P 
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V. Experimental Results 

To test the system, a clip of approximately 60 seconds 
duration, recorded from a film on a laser disk, was played 
back from the server onto the client. Measurements were taken 
on the client and the server, and also on a network analyzer 
attached to the network. A subjective assessment of the quality 
of the play-back was made through observation of the client. 

The recording process allows us to vary the following 
■ arameters. 

• Video compression factor. 

• Audio compression factor. 

• Video frame rate. 

• Video picture size. 

• Number of audio channels (mono/stereo). 

Since the parameters related to the video component have a 
more significant impact on the data rates required, a number 
t recordings of the same clip were made with different values 
tor these parameters but with the same audio settings of stereo 
with 4 bits/sample on each channel at a sampling rate of 9.45 
KHz. The set of clips used are described by the parameters in 
Table I, which also shows the data rates associated with each 
clip. Each clip is identified by a name which is constructed 
as (compression_factor:frame_size:frame_rate), as defined in 
the table footnotes. 

Detailed timing measurements were taken on the network 
.ind the client. The network measurements gave an accurate 
time stamp for each packet on the Ethernet, together with 
sufficient information to determine to which video frame a 
set of packets relate. The client measurements recorded the 
time at which each reassembled frame was transferred to the 
decompression subsystem. The quality of the clip was also 
assessed subjectively, with attention to frame loss or audio* 
drop-outs, and observable synchronization between video and 
audio. 

For local networking, tests were carried out using a single 
Ethernet segment. Initially, a private segment, on which no 
other traffic is present, was used. The restriction of using an 
unloaded network was then relaxed by performing the same 
test over a segment of the main Ethernet shared throughout 
the building. To study the effect of using a network that is not 
point-to-point, the final tests were carried out with the server 
attached to a Tl SMDS link, providing communication with 
the client through an SMDS-802.3 router, again using a shared 
thernet segment in the local area. 
In the space available here, we refer to an examination of 
the performance of the system using a single clip, q2s2f20, 
from Table I. In summary, clip q2s2f20 could be successfully 
played back over each of the three network configurations 
without any data loss. Subjectively, the quality was judged to 
be as good as a play back from a local disk, with acceptable 
synchronization between the video and audio streams. Where 
the network configuration incorporated the shared Ethernet, 
ins performance was dependent on the volume of other traffic 
>n the segment during play back. Similarly, for the SMDS test, 
there would be a dependency on the other traffic through the 
router but for this test no other traffic was allowed as the data 
rates of the clip were very close to the capacity of the router. 



TABLE I 







Mean 




Cl.p id 




(Kbil/si 


Audio rate 
iKbitAi 


q2°s2 6 f20 c 


4418 


707 


76 


q2s2f25 


4416 


883 


76 


q2s3f20 


5614 


898 


76 


qls2f20 


6381 


1021 


76 


*q<n) is the compression fac 
ql = 1.25 bits/pixel. 


or: for this clip. q2 = 


.86 bits/pixel and 


b s2 = 256 x 160. s3 = 288 


X 192 




c f(n) = n frame/s 








TABLE II 




Inter-packet time/s Video Inter-packet time/s 


Network Nomi 


nal Mean 


Var. Nomina 


Mean Var. 


Private 0.088 
LAN 


0.092 


3.2e-5 0.050 


0 050 l.J«-5 


Shared 0.088 
LAN 


0.091 


2.9e-5 0.050 


0.050 1.3e-5 


LAN/WAN 0.088 


0.093 


3.4e-4 0.050 


0.051 4.4e-5 


TABLE III 


Client Inter Frame Times/s 


Network 


Nominal 


Mean 




Private LAN 


0.050 


0.04% 


5.3e-5 


Shared LAN 


0.050 


0.0496 


5.3e-5 



Measurements on the network monitor showed that the 
packet dispersion on the network matched very closely the 
nominal rates at which the real-time transport was attempting 
to send data. In all cases there was some jitter about this 
nominal value, as summarized in Table II. The corresponding 
measurements taken on the client are summarized in Table III. 
These show a considerable increase in the delay jitter when 
compared to the network measurements. 

For a 60 second clip, we ensured that this jitter had no 
degrading effect on the play back by incorporating a delay 
between the arrival of the first data packet on the client and the 
initiation of the decompression subsystem. This delay allowed 
the buffer occupancies to reach a level at which the subsequent 
jitter did not result in data being absent when required by the 
decompression hardware. 

The detailed results of the tests are reported in Sections 
V.1-V.3 below. Again, we concentrate on an examination of 
the single clip q2s2f20. 

RN-01976 

1. The Private Point-to-Point Link 

The dispersion of the audio and video packets for a single 
clip on the private Ethernet segment is shown in Figs. 7 and 
8. These graphs show that the packets on each channel are 
sent according to the nominal rate control with some jitter 
accumulated between the sending transport process and the 
physical network. On both the video and audio channels, this 
jitter is characterized by mainly small, positive and negative, 
deviations about a mean value which is very close to the 
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Fig. 13 Ether 



>r Unsuccessful Play Back. 




Fig. 14. Dispersion of Audio Packets o- 

3. LAN/WAN Connectivity 

Figs. 14 and 15 show the dispersion of the audio and video 
packets on the Ethernet segment during a successful play 
back across the Ethemet/SMDS combination. For this test, the 
Ethernet is the same shared segment employed in the previous 
test, and was used during a period of light load. The mean 
and variance of the inter-packet times on these graphs are 
given in Table II. The graphs clearly show increased jitter on 
both channels, compared to that accumulated on a point-to- 
point link. The inter-packet variance is an order of magnitude 
greater. 

Although this test was successful, it is subject to the same 
restrictions as the shared LAN alone, and also to the load on 
the router. The presence of the router adds to the sensitivity 
of the play back as the maximum throughput of the router, an 
early experimental prototype, is approximately 1 Mbit/s. which 
is very close to the bandwidth required for the clip. Hence 
this test was run with no other traffic permitted through the 
router, and the results only show that it is possible to run the 
jukebox over the existing SMDS/Ethemet combination under 
a controlled scenario. The impact of the router on the packet 
dispersion is significant even in the absence of any other load 
on it. As before, for the 1 minute clip, the pre-buffering is 
sufficient to absorb this level of jitter. 



Fig. 15. Dispersion of Audio Packets over Ethemet/SMDS combination. 

vi. Discussion 
The results of the experiments that have been conducted 
show that applications with real-time requirements, such as ; 
the Video Jukebox, can be supported on standard currently 
available equipment. The Video Jukebox can be used in local ; 
stand-alone operation with a local disk, or as a networked 
service from a remote server which has the potential to support . 
multiple channels and multiple clients concurrently. From a 
user s point of view, the change from a local to a networked . 
service is seamless. In this section, we examine some of 
the implications of our results, and examine some of ine , 
limitations of existing systems. 
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/. Networking Multi-Media Applications 

There is no doubt that the tremendous decrease in the cost 
of compression technology is leading to the wide introduction j 
of video systems; forecasts predict an increase in the video ; 
conferencing equipment market from around $150 million! 
today to anywhere from $240 million to $1 billion by 1995 
[12]. There are many applications for which we consider, 
our system to be adequate. Over the last two years there 
has been a rapid move towards using multimedia material in 
training, particularly where large numbers of people need to be 
trained rapidly. Current systems use either video tape or laser 
disc, with some moving towards compact disc based digiw 
systems (either DVI [13] or CD-I). There are advantage' 
and disadvantages with all these systems. Tape is not >uited 
to random access; laserdisc and CD cannot be easily ediW 
and. to be cost effective, need to be produced in laOJ e 
quantities. There is certainly a place for a video server for claj> 
room based teaching which stores random access, editfb* 
multimedia material. Organizations such as airlines need 
train large numbers of cabin crew quickly on basic custorn * 
service principles and would benefit from a classroom in *>" • 
multimedia material could be easily used. M 
One step removed from the system developed here 
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.•AO way video conference and many people have proposed 
mat such a service could have wide applicability. In particular 
remote support, maintenance, and monitoring of anything from 
power stations to aeroplanes is an interesting area. Airlines 
have a requirement for such a system to help reduce aircraft 
downtime caused by unscheduled maintenance at airports 
■a here they do not have a service depot. 

; here are three main factors that need to be addressed in 
L >. ending our prototype into a robust and versatile system. 

• Presence of other traffic, particularly over a WAN. 

• Multiple clients using a single server. 

• Extension to applications having more stringent real-time 
demands, such as two-way video communication. 

These have consequences for the transport and network 
layers, as described below. 

i A. Video and Audio Transport 

;.'al-time data transport is essential to support video and 
juuio streams. One of our initial experiments with the Video 
Jukebox used a request-response model for data transport, 
based on TCP. In this system, the client would request data 
from the server at times initiated by the codec, and the server 
would respond by sending this data over a TCP connection. 
This approach was successful in very limited circumstances, 
but generally exhibited poor performance. By placing the 
source of timing in the server and using this to drive a 
ra:: -controlled transport layer, we can achieve more accurate 
timing and higher throughput. 

For a system like the Video Jukebox, the existing protocols 
will be adequate for operation in many situations. Consid- 
eration of the factors listed above shows that there are two 
areas where we could usefully enhance the existing protocol. 
Firstly, the rate control should be extended to operate in a 
burst mode. It is not possible in practice to send data at a 
ct-Mant mean rate with total precision. Inaccuracies occur 
fr< the source timing itself, and are also introduced by other 
variable delays in the server. The jitter resulting from the disk 
access on the server for a single clip could be minimized, 
by contiguous file placement for example, but this would 
reappear if the server had multiple clients viewing different 
clips simultaneously. A file system that is designed to support 
real-time data streams, as described in [14] for example, will 
heln but it is not possible to eliminate jitter throughout the 
«m j system so the rate control should be extended so that 
delays can be accommodated. It is difficult for the 
transport to accommodate jitter within the network, which 
"> addressed in the next section, but for the other sources. 
J variable transport data rate can absorb the jitter. The rate 
''Pecific.ation should include a nominal rate, corresponding to 
! he mean data rate of the channel, with the addition of a 
higher rate that can be used periodically when the actual data 
rate on the channel has not matched this nominal rate. (Note 
that a more adaptive rate control mechanism may also be 
a P; :Mpriate to support data sources that are more bursty by 
nature. MPEG-compressed video could be an example of this 
'ype of source.) 

Knowing when to switch data rates is the second area 
f °r enhancement. In a limited well-controlled environment. 



where the types of delay expected are known in advance, 
the switching period can be set a prion. This was in fact 
done on the video jukebox to achieve the same effect as 
the pre-buffering employed. This approach is fragile as it 
takes no account of the actual behavior of the channel on 
the network and in the client. A more satisfactory solution is 
to incorporate feedback from the client to the server giving 
the server knowledge of the actual data rates being achieved 
and the level of buffer occupancy. The transport can then 
adjust dynamically to maintain the real-time requirements of 
the channel. This type of facility is analogous to window-based 
flow control such as that employed in TCP. but with the timing 
under the direct control of the source, and feedback provided 
only when adjustments are required to this timing. 

We also need to reconsider the requirements in order to 
support multi-media applications other than the video jukebox. 
Meeting the real-time requirements of the jukebox is eased 
by the fact that there is no interaction between client and 
server on the data paths. This gives us the scope to use pre- 
buffering on the client to overcome jitter on the data channels. 
Some applications, such as multi-media conferencing, are 
multi-way and interactive and would therefore not permit this 
technique to be used. Those applications will place much more 
stringent demands on the network layer to meet the real-time 
requirements, as discussed further below. 

These extensions to the transport layer mean that the pro- 
tocol incorporates many of the features contained in some of 
the proposals that have emerged for a new standard transport 
layer in future high-speed networks. Examples include XTP 
and VMTP. Some of these protocols also include support for 
multicast, which is a useful feature for applications requiring 
Multipoint distribution. There is currently no consensus on 
the relative merits of these candidates, and little practical 
experience in using them on real networks, but it is likely that 
one or more of these protocols will form an integral component 
of future networked multimedia systems. 

I.B. Networks for Video and Audio 

A more adaptive transport protocol of the type described 
above may not be sufficient to maintain real-time channels in a 
general network environment, particularly where the channels 
have harder real-time demands than those for the jukebox. 
Jitter due to the network is not easy to predict, particularly 
over networks that are not point-to-point, and especially over 
large WANs. Reports on the load on the Internet, for example, 
show highly variable levels of congestion with corresponding 
variations in delay. In the general case where a user wishes to 
operate any time at any location, it highlights the requirement 
for support for real-time data at the network level in addition 
to higher levels. There are two ways of by-passing this 
problem. 

• Exclude other traffic from the network. Dedicating a 
segment of the network to the real-time application is 
a short-term solution that will enable a limited number of 
users to operate. 

• Use a network that has some support for synchronous 
channels. FDDI is one candidate. ISDN could be used 
for connection over the wider area. 
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The first approach may be an acceptable way of introduc- limitation of the Ethernet, but a result of some specific deia,!.- 

in* mult-media applications such as the v.deo jukebox to of the .mplementation of the networking software m the cl.er . 

selected users, and for pining exposure for the benefits of A -lost" Ethernet frame, which results in a partially assembl, 

hese systems It has obvious limitations in terms of cost. UDP packet in the client, together with the implementation „i 

convenience, and connectivity. The second approach provides the timing in the PC network software, can lead to a temporary 

a number of potential platforms, particularly in the wide deadlock of the type observed during ; unsuccessful play-back, 

area Tis currently unclear in which direction the public The required behavior, for this application, in the presence 

vice p oviders wUI move, and it ,s certain that developments of an unreliable link layer is for the Cent to 

here will be influenced as much by political and economic within a very short time (determined by the application ,t 

tarsi they will by techmcal considerations. See (15] for ., cannot be successfully received, so that subsequen ^d,, 

further insight into the opt.ons available here. Our interest frames can be received. We would expect to observe no n . ; 

iTSS paper ,s bnnging real-ume services to computer users than occasional loss of a video trame : ,r , his case, 

now. so we focus on the use of the large existing base of the software available to us. we were unabte t » expenn n 

conventional packet-mode networks. Tbe Internet for example with such changes. This is an example of where specific 

ZISZZF&M connectivity, and is still expanding, details of protocol implementation, targeted at conven ,onal 

F^nr P acket-mode operation has the attraction of economy data transfer, are not always appropriate for exchange of real- 

. . . k..« m , ruvrvrtrrivin isochronous The efficient multiplexing of traffic on these networks pro\ . .s 

the potential for bandwidth on demand, and widens the si 



channels are not necessarily appropriate anyway, and can lead 
to inefficient use of network resources. With extensions to the 
network layer, existing packet networks can be used to support 
real-time channels without the restrictions imposed during our 
tests. Enhancements are required in two areas. 



Resource reservation. 
Dynamic resource control. 



for networked multimedia. (The articles in [16] give a good 
overview of future trends in multimedia communication.! 
These networks however introduce difficulties of their own. 
in that the end-to-end protocols have to handle periodic eel! 
loss. Note also, that high-speed networks in general present a 
different performance characteristic in their latency/bandwidth 
parameter which fundamentally changes the way in which 
tests protocols need to operate. These problems are still at the r 



The requirement for reservation is |^«ra^by *e ^"-^ ^ w • dQ M discuss mem further here . Se, 
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on the Ethernet/SMDS network. In this case, i 

required by a clip is close to the total capacity of the router. It for an introduction, 
is necessary to negotiate with components such as the router 

to claim the resources required during a session, and to incor- 2 ^ Subsyslems 
porate resource control mechanisms within these components 

L ensure that a quality-of-service that is negotiated is indeed The video and audio 'f^.^J^^^. 

fulfilled. A signalling protocol for this purpose is defined by prototypes originally developed ,n 1991 by ^log-c » 

he existing ST-II standard, and this has been incorporated plore operation from a local disk. We have found that ope anon 

11 T fotll architecture of the video jukebox. How- over a network ^ 

ever there are currently few implementations in commercially these subsystems. As an example, one such require* n 

a a iable routers of th'e resource control required at the IP in the data transfer process to *. ^ ™£ J 

, system it was necessary to pass whole frames ot viaeo 

t use of a shared Ethernet for local connectivity has to the compression ^^^^^^ 

some limitations. The non-deterministic access in Ethernet frames. Th,s causes problems ^«^^ y £. 

means that users will always be subject to some loss of data, transport was ™ Z*** 

Alternative technologies, such as Token Ring, provide more around 4 KBytes wh ^.^ f ^ 

deterministic performance and therefore may be considered a wh.ch after encapsulau on y™^ ? £^ y ^ , r „ u l. 

more suitable base for this application. However, as already the maximum ^J^*™£ JS^h ,h * 

noted.mecharactensUcsofthevideotrafficarenotnecessar.ly the frame "fragmented _mto a ^ a , 

suited to this type of access. Further, some data loss is transmitted back to back on the MW«k. These pac 

ole able, and denature of the MAC protocol in Ethernet then reassembled into *e ^^^Sd nto - 
is such that rather than losing a lot of data, it is more network subsystem before the whole datagram 
common that data is just delayed during periods of high user buffer on the 



load. With a more adaptive transport protocol, and perhaps 
with larger buffers on the client, it is likely that a fair 
of degree of tolerance to this delay can be built into the 



cessful playback on the shared Ethernet is not an inherent 



In order to reduce this bursty transmission ot t"* 
frames, which can lead to frame loss in the receiving ' • 
it would be useful to fragment the original video frame i ^ 
number of datagrams. Each datagram would correspo-u ^ 
single network packet and the individual packet 
onto the network could then be controlled by the server. 
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means that at the client at some point these individual packets 
must be reassembled into the original frames. The current 
interlace between the host and the compression subsystem 
forces this to be done on the host before the data transfer 
to the compression cards. Given by the load on the host 
machine processor it would be useful to explore whether this 
reassembly could be better done on the compression subsystem 
control processor. 

Another area which has not been explicitly addressed so 
ur is the synchronization of source and sink clock rates and 
ihe synchronization of separate video and audio streams. The 
source and sink sample clock rate problem can be tackled by 
modifying the effective display time of a sample or frame to 
compensate for the differing sample rates (18]. One possible 
mechanism for doing this is to monitor the receive FIFO depth 
And use this to modify the playout time of buffered frames 
or samples, in a manner similar to a phase locked loop, or 
ndeed a similar loop could be used to control the playout 
.lock frequency itself. Alternatively end system clocks could 
be synchronized by the use of NTP (Network Time Protocol) 
[19], though there are obviously cases where this might be 
impractical, such as between separate organizations. 

As noted earlier, the video and audio in this system are 
not linked by timestamps or other means, so both relative ' 
sample clock drift or data loss on either channel will cause 
the synchronization to slip. Many of these issues of terminal 
Nynchronization and stream synchronization are being consid- 
ered as part of RTP, the real-time transport protocol, which is 
currently an Internet Draft document and the reader is referred 
there for further discussion [20], 

A final point concerns the compression algorithms and 
media quality used in this system. As has been noted earlier 
the video was compressed using the JPEG algorithm which 
was originally developed for compression of full color still 
images [21 ]. The development of single chip implementations 
•if this algorithm has enabled it to be used at rates up to 
H) frames/second, and has led to its use as a video coding 
algorithm [22]. The disadvantage of this algorithm compared 
to MPEG or H.261 [23] is that it makes no use of the temporal 
redundancy of successive frames. This makes the algorithm 
considerably less efficient. Typical figures suggest that MPEG 
gains a factor of 3 in the overall compression of many image 
sequences for an equivalent image quality (4]. 

The video sequences used in our experiments were dis- 
mayed as 1/4 screen VGA at 20 fps in full color. At the bit 
ate used the image quality was slightly worse than VHS video 
when viewed at typical computer screen distances. Whether 
this image quality is adequate depends very much on the 
application, but certainly for some applications mentioned, 
such as general purpose training, it would be sufficient. One 
area that needs further research is the ability to develop image 
quality metrics that can be related to applications requirements. 

The audio quality could probably best be described as high 
nd AM radio quality, though we did run the audio in stereo 
'^hich gave the impression that it was closer to FM radio 
quality even though the sample rate used was only 9.45 kHz. 
Again it is difficult to relate this to applications requirements, 
but it was perfectly adequate to replay the film soundtrack. 



3. Disk Technology 

Disk technology is one area of concern if the video jukebox 
application, or related applications such as video mail, are to 
become viable in the near future. Parallel access to multiple 
stored video streams, or multiple random access to a single 
stream incurs significant overheads as the read heads track 
from one section of the disk to another. This greatly reduces 
the maximum possible rate that can be sustained off the disk, 
as was shown in Section III. 

This problem of disk rates is not however exclusive to 
multimedia applications. Over the last decade the processing 
power of single chip CPUs has grown at 50-60% per year 
while dynamic disk performance has merely doubled. As a 
result there is considerable activity throughout the industry 
aimed at improving disk performance. Many companies are 
considering multiple disk array subsystems, or developing 
disks with multiple active read heads. There is also the 
possibility of low cost solid state FLASH disks appearing 
over the next few years. This implies it should be possible to 
develop relatively low cost video servers capable of delivering 
multiple streams during the mid to late 1990's. 



vn. Conclusions 
This work reported in this paper has shown that it is 
possible to build networked video and audio systems using 
standard equipment that is available off-the-shelf. As expected 
there are limitations in this approach. These are in the data 
rates and the number of users that can be supported, and 
the classes of application that can be provided. However, 
within these limits, video compression technology has given 
us the capability to provide some networked multimedia 
services now. The experiments conducted have highlighted 
the areas where further progress is required, either to support 
applications with real-time interaction or to provide higher data 
rates. We conclude by stating these areas among our general 
observations below. 

• It is not necessary to use an isochronous network to carry 
networked video and audio. Conventional packet-based 
networks can be used to support many multimedia appli- 
cations, althouh some work is required in the protocols 
used, and the precise implementation of these protocols. 
In some ways, these networks are better suited to the 
application, particularly when the data streams are bursty, 
as with compressed video for example. 

• The system characteristics required to replay stored video 
and audio, particularly end-to-end delay and delay jitter, 
are much less stringent than those required for real- 
time full duplex video conferencing. The best opportunity 
in the near term for developing multimedia products 
integrated with conventional computer systems is with 
applications dependent on the replay of stored video, not 
those requiring real-time full duplex conferencing. 

• There are many system design issues that arise when com- 
pressed isochronous data is brought into a conventional 
computer system. These problems become much more 
acute if the system is networked. 

RN-01980 
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> In the future, we seek disk throughput as being a greater 
limitation than network bandwidth. Thus, it will be nec- 
essary to provide specialized disks, or disk arrays, for 
a server to support a large number of data channels 
simultaneously. 

• It is a requirement to provide real-time transport for multi- 
media applications. However, the overhead of protocol 
processing is not significant. It is not necessary to pro- 
vide custom hardware for this purpose. Applications that 
require multicast may prove more of a problem in this 
respect. 

• The reduction in data rates achieved just through the use 
of JPEG hardware means that bandwidth requirements of 
a limited number of users of a real-time system such as the 
video jukebox can be met by existing networks. Of greater 
importance than the total bandwidth in the network is the 
mechanisms provided for resource allocation. 
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Abstract 

We present the design and implementation of the "Ninja 
Jukebox" . an infrastructural service that allows a com- 
munity of users to build a distributed, collaborative 
music repository that delivers digital music to Internet 
clients, and that performs simple collaborative filtering 
based on users' song preferences inferred by the service. 
The Jukebox, implemented in Java, was designed to al- 
low rapid service evolution and reconfiguration, simplic- 
ity in participation, and extensibility. We demonstrate 
that our careful use of a distributed component archi- 
tecture enabled rapid prototyping of the service, and 
that our use of carefully designed, strongly typed inter- 
faces enabled the smooth evolution of the service from 
a simple prototype to a more complex, mature system. 



1 Motivation 

The Internet is evolving towards a service in- 
frastructure: a network of rich, robust, and often 
professionally maintained services that are conve- 
niently accessible to people through the web. How- 
ever, the fact that these services rely on the web to 
present their content effectively restricts their users 
to be human: the lack of structure and well-defined 
types in web content makes it all but impossible for 
computer programs to interact with most Internet 
services, despite the obvious benefits of being able 
to do so (such as service composition, richer search 
and information access services, etc.). Several re- 
cent efforts have attempted to introduce such struc- 
ture and typing to the web, such as the WIDL [12] 
and WebL projects [16], or the ongoing W3C XML 
developments [6]. 

The UC Berkeley Ninja project 1 is pursuing a 
complementary path to these efforts: we are build- 
ing an infrastructure for supporting fault toler- 
ant, highly available, scalable services composed of 
a number of well-circumscribed components, each 
of which exports a strongly-typed, programming- 

1 Project home page: http://ninja.es. b«rlul«y.«<iu 



language level interface [10] accessible using RPC- 
like mechanisms [4]. Explicitly exposing service in- 
terfaces and making use of strong typing has a num- 
ber of benefits, including forcing authors to carefully 
design the boundary between their services and the 
rest of the world, making those services accessible 
to programs, and allowing the composition of ser- 
vices by infrastructural elements. We believe that 
when a large number of such services are deployed, 
a network-externality effect will occur, causing the 
power of an individual service to be greatly en- 
hanced by interaction with the many other available 
services. 

In this paper, we describe one such service: the 
Ninja Jukebox. The Jukebox allows a community 
of users to collaboratively build a distributed mu- 
sic repository out of both music CDs and MP3 files 
stored in local filesystems, and to use simple collabo- 
rative filtering to allow individual users to filter their 
music preferences according to other community 
members' explicit and implicit recommendations. 
In section 2, we discuss the design rationale that 
went into the Ninja Jukebox, and reflect on how the 
Ninja project's service philosophy influenced this 
design. Section 3 describes our Java-based Juke- 
box implementation and how we smoothly evolved 
it. and section 4 presents some of the limitations 
of our implementation and the lessons we learned 
while building it. Finally, in section 5, we present 
related work. 



2 Design Philosophy 

The Ninja Jukebox application was originally 
conceived of to "scratch the itch" of several gradu- 
ate students: to be able to harness the large number 
of unused CD-ROM drives in the 100+ node Berke- 
ley network of workstations (NOW) [3] and present 
a single, unified view of all music in all drives. Over 
time, the Jukebox has vastly evolved in complex- 
ity and richness. It now transparently supports 
both raw audio CDs in CD-ROM drives and MP3 
files in local filesystems, and it performs authentica- 
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rion and access control in order to adhere to copy- 
right laws. It exports both a programmatic interface 
and an HTML interface for backwards compatibility 
with browsers; its programmatic interface includes 
a collaborative filtering service that deduces users' 
song preferences, and allows one to construct song 
playlists based on simple boolean combinations of 
other users' preferences. 

The Xinja .Jukebox was designed with several 
specific goals in mind. The first goal was that the 
Jukebox should be a communal, collaborative ser- 
vice. Individuals should be able to add or retract 
their personal collection of music from the Jukebox 
as they please, without requiring special interven- 
tion from a centralized administrator. This implies 
that contributors should be given as flexible as pos- 
sible of a "service contract" — they must be allowed 
to retain control over their own contributions, while 
still ensuring that the overall Ninja Jukebox ser- 
vice maintains as stable as possible of a view to 
the rest of its users. The Jukebox ser%-ice therefore 
must be able to adapt to changing group member- 
ship by gracefully masking unpredicted failures or 
disappearances. 

Another goal was for the Jukebox service to re- 
tain flexibility, extensibility, and the facility for 
rapid evolution. As the evolution of the Jukebox 
has explicitly demonstrated, applications are not 
cast in stone, and services should not remain im- 
mutable once they have been released and are in use 
by applications and users. We therefore wanted our 
infrastructure to admit the evolution of its sen-ices, 
and we wanted to design the Jukebox service in such 
a way as to most easily allow it to be extended in 
unforeseen ways. 

2.1 Design Implications 

In order to meet the above goals, we made the 
following three explicit design decisions: the adop- 
tion of a distributed component architecture to de- 
compose the Jukebox service into a small number 
of carefully chosen, functionally decoupled pieces, 
the imposition of a rich, strongly typed interface on 
these components (including carefully chosen data 
structures that precisely describe the contents of the 
Jukebox), and the use of soft state to achieve even- 
tual consistency in the Jukebox. 

Disciplined use of a distributed component 
architecture: as exemplified by Sun's Jini [21] and 
Corba [20], distributed component architectures ad- 
vocate the use of an object-oriented language to 
decompose applications into smaller, self-contained 
objects, and the distribution of those objects across 



machine boundaries, relying on mechanisms such as 
RPC to perform inter-object communication. Com- 
ponent architectures make it simpler to begin with 
and maintain a clean design throughout the service's 
lifetime: the separation into objects allows for a sep- 
aration of concerns, a tenet of good software engi- 
neering. 

In the Ninja Jukebox, we decomposed our ser- 
vice into three major components, each respectively 
responsible for: (1) managing local collections of 
music (independent of physical and logical format), 
(2) the integration of many such collections of music 
and the maintenance of metadata about the music 
(such as users' song preferences), and (3) the client- 
side retrieval and playing of music from the service. 
This deliberate decomposition is what ultimately 
allowed the Jukebox to evolve so painlessly — each 
component's functionality is well encapsulated and 
isolated from other components, meaning that these 
components can internally evolve without affecting 
the rest of the system, and that new components 
can be added that compose with existing pieces to 
enhance the overall service. For example, the com- 
ponent responsible for managing local collections of 
music encapsulates information and access mecha- 
nisms particular to a music format, and thus the 
transition supporting only audio CDs in CD-ROM 
drives and also supporting MP3 files stored in a 
file system merely involved introducing a subclass 
of that component. Similarly, we could envision 
adding subsequent subclasses that would contain all 
music available from popular music web sites (such 
as http://www.mp3.com), or would serve as a gate- 
way to receive music broadcasts (such as MBONE 
vat sessions). 

Strongly-typed interfaces: in our opinion, the 
use of a distributed component architecture is only a 
partial step towards a properly decomposed service: 
the careful design of the interfaces between those 
components is a second, crucial step. An inter- 
face to a component is a declaration of both syntax 
and semantics, and as such is a contract that binds 
the component author to maintain those semantics 
even when the component is enhanced or extended 
through subclassing. Furthermore, the API to the 
service ultimately dictates the expressive power that 
clients of that service have available to them. We 
believe that an infrastructure service is defined by 
its interface and a declared set of guarantees about 
its performance and availability. 

In the Jukebox, our APIs include data struc- 
tures that richly describe content. These structures 
enable intelligent applications such as clients that 
group music on arbitrary terms, or that allow users 
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to construct playlists based on either explicit dec- 
larations or inferred preferences gleaned from the 
service's observation of their listening history. This 
focus on strongly-typed interfaces helps remove bar- 
riers to rapid service evolution by forcing service au- 
thors to carefully design and explicitly declare each 
of their components' interfaces, and therefore their 
implied service contracts. 

Use of soft state to achieve eventual con- 
sistency: as a side-effect of making the Jukebox 
collaborative, we could not rely on any particular 
person's contributions to remain available. We thus 
designed the infrastructure so that a contributor pe- 
riodically announces the presence of his/her music 
to a common master repository in order to add mu- 
sic to the overall Jukebox. The act of a person 
adding music to the Jukebox is therefore treated 
as a hint rather than a promise: components can- 
not rely on that music being there, and they must 
gracefully handle the case in which a particular song 
abruptly becomes unavailable. We also treat entries 
in the master repository as a lease, and expire them 
if the periodic announcements stop. The master 
repository correspondingly contains an approximate 
view of all available music: this view continually ap- 
proaches the correct view over time. This leased 
approach is also used in our authentication mecha- 
nisms: when a client requests a song from the Juke- 
box, it must first authenticate itself, the result of 
which is a capability that is good for a single use or 
for thirty seconds: the Jukebox components lazily 
time out these capabilities as necessary. 

Not all state in the Jukebox is soft-state: users' 
song preferences, for example, are stored as hard 
state by dedicated, highly-available infrastrucure in 
what we call a "base" [10]. A base is composed of ev- 
erything needed to build an available compute clus- 
ter, including system administration, a secure ma- 
chine room, redundant networks, UPS, etc.. and as 
such is an ideal environment in which to protect 
hard state. 



3 Implementation 

This section of the paper describes the imple- 
mentation of the Ninja Jukebox service and client, 
and their evolution through three stages of func- 
tionality. The first version of the service only sup- 
ported the playback of raw audio CDs from the CD- 
ROM drives of Jukebox servers. In the second ver- 
sion of the service, we added the ability to convert 
raw CDs into compressed MP3 files, and for those 
MP3 files to be played over the network; this sec- 



ond version also included authentication and access 
control mechanisms to enforce copyright protection. 
Finally, we added a simple collaborative filtering 
mechanism to the third and current version of the 
Jukebox. 

We chose to implement the Ninja Jukebox ser- 
vice in Java, both because Java trivially enables dis- 
tributed components and because the Ninja project 
has developed a significant amount of infrastructure 
in Java. This infrastructure includes authenticated 
remote method invocation (RMI) and a cluster- 
based service platform called "MultiSpace" [10] that 
was designed to support scalable and rapidly evolv- 
able infrastructure services. 

3.1 Ninja Jukebox vl.O: raw audio CD 
playback 

As shown in Figure 1, the first version of the 
Ninja Jukebox implementation was decomposed 
into the following elements: 

The SoundSmith: SoundSmiths are responsi- 
ble for indexing and maintaining a structured collec- 
tion of music. The version 1.0 SoundSmith indexes 
music on an audio CD in a local CD-ROM drive, 
making use of a service that acts as an HTTP to 
RMI gateway to provide programmatic access to an 
online "CDDB" database[15]. This database pro- 
vides a mapping from a CD's track timing infor- 
mation to detailed information about the CD's au- 
thor, song titles, and song durations. SoundSmiths 
periodically send beacons to the MusicDirectory; 
through these beacons, they both announce their 
existence to the MusicDirectory and present the list 
of songs that they maintain. Anyone that wishes 
to contribute music to the Jukebox must only run 
a SoundSmith that can index and serve that mu- 
sic. SoundSmiths can be startedup and torn down 
at any time, as each SoundSmith is completely au- 
tonomous, and the beacons emitted by the Sound- 
Smith are treated as soft-state by the MusicDirec- 
tory. A SoundSmith serves music by streaming it off 
of an audio CD from the CD-ROM drive of an infras- 
tructure workstation and transmitting it in uncom- 
pressed . au format through an (untyped) HTTP in- 
terface. 

The MusicDirectory: As previously men- 
tioned, SoundSmiths periodically beacon their ex- 
istence an.1 a list of their music to the MusicDi- 
rectory. The role of the MusicDirectory is to keep 
track of these beacons, and to build up an integrated 
list of all available music and of all running Sound- 
Smiths. Clients use the MusicDirectory as a level of 
indirection that shields them from needing to inde- 
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Figure 1: The Ninja Jukebox vl.O architecture 



pendently discover the location of all SoundSmiths 
in the Jukebox. Ultimately, this centralized Mu- 
sicDirectory limits the scale of a Jukebox, since all 
SoundSmiths repeatedly send it listings of music. 

Jukebox Clients: Jukebox Clients interact 
with a MusicDirectory to gather a listing of available 
music, and with many SoundSmiths to receive and 
play specific songs. We have currently implemented 
two clients. The first presents a graphical user inter- 
face to the user (figure 2), and allows users to build 
playlists of available songs. Music streamed to this 
client is shuttled to external music players that un- 
derstand many music formats and have the ability 
to play music as it is streamed over the network. In- 
ternally, this client is decomposed into a GUI front 
end and a song selection back end. The GUI front 
end provides the user with controls for constructing 
playlists, and with familiar play, stop, pause, fast- 
forward, and reverse buttons. The song selection 
back end selects specific songs to play given the list 
of currently available music from the MusicDirec- 
tory, the user's manually constructed playlist, and 
events that are generated when the buttons such 
as play or stop are pressed. The second client is 
a proxy that converts between the APIs and data 
structures exported by the Ninja Jukebox service 
and HTML forms. This proxy allows conventional 
HTML browsers to access the Jukebox: music is 
streamed through the proxy to the browser, or pre- 
sumably to the browser's helper applications that 
can actually understand specific audio formats. 

This first version of the Jukebox service was well 
received even though it suffered from a number of 
drawbacks. The fact that all audio was transmit- 
ted in an uncompressed format resulted in exces- 
sive traffic on our local networks, greatly limiting 




Figure 2: The Ninja Jukebox client GUI 

the number of clients that could simultaneously ac- 
cess the Jukebox. Furthermore, the fact that mu- 
sic could only be served from audio CDs physically 
present in CD-ROM drives limited the amount of 
music that could be present in the Jukebox at any 
given time, since we had a limited (although large) 
number of CD-ROM drives at our disposal. Finally, 
the lack of any security infrastructure prevented 
us from widely releasing the Jukebox service and 
client, even within our department, since it would 
become trivial for users to violate copyright protec- 
tion legistlation, either accidentally or deliberately. 

3.2 Ninja Jukebox v2.0i MP3 playback 
and security 

The separation of the Jukebox into the previously 
described components satisfied our primary design 
goal: to construct a collaborative service, in which 
anyone can contribute their collection of music to 
the Jukebox. A second design goal was to allow for 
the evolution of the service; in order to test this goal 
(and to satiate the demands of the clients of the vl.O 
Jukebox), we extended the Jukebox functionality to 
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produce the v'2.0 version of the service. This version 
of the service attempted to overcome the drawbacks 
of the vl.O prototype by including two new major 
features: the transparent support of MP3 files, and 
support for access control and client authentication. 

We also slightly modified the Jukebox by hav- 
ing SoundSmiths only report their existence to the 
MusicDirectory rather than the complete list of mu- 
sic that they manage; in the v2.0 infrastructure, 
clients discover SoundSmiths through the MusicDi- 
rectory, but then ask each individual SoundSmith 
for its list of locally available music. This modifica- 
tion drastically reduced the size of the SoundSmith's 
beacons, which eased the scaling bottleneck caused 
by the centralized MusicDirectory. This bottleneck 
became increasingly evident as the body of music 
stored in the Jukebox grew to over 4,400 songs (375 
albums, accounting for more than 25 gigabytes of 
hard drive space and 320 hours of music). 

3.2.1 MP 3 Support 

MP3 support was surprisingly easy to add to the 
Jukebox service. To do it. we simply created a sub- 
class of the SoundSmith component that understood 
how to index and stream MP3 files on a regular 
filesystem instead of audio tracks from an audio CD 
in a CD-ROM drive. The data structures embed- 
ded in the SoundSmith's beacons are only meta- 
data, and as such are totally independent of the 
specific format in which the music is actually kept. 
In order to play music, Jukebox clients interact with 
the MusicDirectory service to fetch an HTTP URL 
for a song; this URL is served by the SoundSmith 
that maintains the song. Because the song data 
is streamed to external music player software that 
happens to understand MP3 formatted music, the 
Jukebox clients never need to understand anything 
about the music format. When we deployed several 
of these MP3-aware SoundSmiths in our infrastruc- 
ture, Jukebox clients suddenly became aware of a 
much larger set of available music, and transpar- 
ently began accessing the newly available MP3 files. 

The MP3 files maintained by SoundSmiths are 
created by helper daemons that batch convert music 
CDs to MP3 formatted files by first "ripping" raw 
audio from the CD, and then compressing that raw 
audio into an MP3 file and its associated artist and 
album metadata (figure 3). These daemons run in 
the background on all of our Jukebox workstations, 
effectively crawling the Jukebox for new music to 
MP3 compress and add to the Jukebox. While this 
conversion is happening, an audio CD SoundSmith 
can serve the music directly off of the audio CD; 
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Figure 3: MP3 Support in the Jukebox v2.0 



after the conversion is finished, the music can be 
served in the preferable MP3 format. 

We attribute the ease with which we added sup- 
port for MP3 files to the Jukebox infrastructure to 
our use of a distributed object infrastructure and 
to the strongly- typed interfaces between our Juke- 
box components. The ability to subclass in order to 
specialize the SoundSmith allowed us to maintain 
its RMI interface, and thus upgrade its functional- 
ity in a manner that was transparent to the rest 
of the Jukebox. Transparency was meaningful be- 
cause of the presence of explicit interfaces between 
the components; achieving transparency in this case 
was a manner of maintaining both the syntax and 
declared semantics of the interface. 

3.2.2 Security Infrastructure 

For the Jukebox, the only relevant security issues 
are access control and authentication. Our authenti- 
cation mechanism is based on SecureRMI, a variant 
of RMI— Java's standard remote method invocation 
protocol [17]— that we have developed to operate 
over a cryptographically-secured channel. With this 
tool in place, the access control problem becomes 
relatively easy: for each song in a SoundSmith, that 
SoundSmith maintains an ACL (a list of Secure- 
RMI principals allowed to play that song). The ac- 
cess control mechanism thus is as simple as having 
the SoundSmith look up an entry in a list. The 
SoundSmith also hands out capabilities to authen- 
ticated principals that allow them to access specific 
songs for a limited amount of time: these capabil- 
ities are good for a single access, and expire if not 
used within 30 seconds. Note that the MusicDirec- 
tory does not need to authenticate the identity of 
clients, as it is entrusted only with a list of available 
songs and SoundSmiths, and not the songs' con- 
tent. For the proxied HTML-based client to work, 
however, the proxy itself must be entrusted with 
its users' credentials, since HTML browsers do not 
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have the ability to interact with our SecureRMI in- 
frastructure directly. 

Currently, our policy for access control is rel- 
atively simplistic: a principal can only listen to 
a copyright-protected song if she has previously 
demonstrated knowledge of the song contents (e.g. 
by uploading it to the Jukebox); unrestricted access 
is given to music marked as non-copyrighted. This 
approach is inspired by legal considerations: if peo- 
ple can't abuse the Jukebox to gain access to mu- 
sic they don't already have, it seems unlikely that 
the Jukebox will be accused of violating copyright 
laws. However, the Jukebox could also accommo- 
date more sophisticated policies for access control, 
such as support for group ownership where only one 
group member can listen to a song at a time, or 
a pay-per-use scenario under which royalties could 
be collected and submitted to the copyright holders. 
The flexibility of our design makes such variations 
on authentication quite straightforward. 

Returning to the authentication mechanism. Se- 
cureRMI was the piece of the security architecture 
that demanded the vast majority of our security en- 
gineering effort. SecureRMI (optionally) authenti- 
cates the endpoints and then encrypts the remain- 
der of the communication with a Triple-DES session 
key derived from a Diffie-Hellman key exchange. We 
also provide a certification infrastructure for end- 
point public keys and tools for managing them; cer- 
tificates bind the service's fully-qualified class name 
(or the client's identity) to the server's (or client's) 
public key. 

Of course, there is nothing new about the con- 
cept of establishing a secure channel with the use of 
encryption [18, 22). However, we feel that our im- 
plementation may be of interest primarily because 
it exists: we are not aware of any other free, Java 
implementation with similar functionality. 2 

One novel feature of our SecureRMI is that it pro- 
vides transparent support for a very broad range of 
"authentication" technologies. We have abstracted 
away many of the irrelevant details of the algo- 
rithms to build a very general model of authentica- 
tion. For instance, public-key authentication is im- 
plemented in DSAAuthenticator and DSAVerif ier, 
which are subclasses of the generic Authenticator 
and Verifier classes; SecureRMI only references 
the generic Authenticator/Verif ier superclasses, 
so it is ignorant of the details of their implementa- 
tion. This architecture is very flexible: after the 
core infrastructure was in place, we later added 

2 JDK 1.2 includes hooks so you can encrypt R.MI commu- 
nications with SSL, if you have a SSL library; but we do not 
know of any free SSL implementations for Java [8], 



a symmetric-key challenge-response protocol with 
about two days of coding. 

As a result, extending the Jukebox to support 
pay-per-use access will require only minimal effort. 
We would just add a PayPerUseAuthenticator 
that, instead of sending a public-key signature for 
authentication, sends a digital coin. This is a di- 
rect result of our design goal that services be easy 
to evolve and extend. 

Our general model of authentication also allows 
each collaborator to specify her own access-control 
policies for the music she serves; one SoundSmith 
could be serving music on an ACL basis, another 
could be serving only free music, but only to hosts in 
a certain domain, and others could be charging vari- 
ous amounts to listen to the audio stream. The flex- 
ibility provided by this mechanism further enhances 
the communal, collaborative nature of the Jukebox, 
by removing access-control policy from any central 
authority. 

3,3 Ninja Jukebox v3.0: the collabora- 
tive DJ service 

Most recently, we have extended the Jukebox ser- 
vice to provide song selection based on inferred user 
preferences as well as some simple collaborative fil- 
tering functionality. In the vl.O and v2.0 Jukebox 
services, song playlists are manually constructed by 
users and successive songs to be played are cho- 
sen from these playlists by simple random selection. 
Our collaborative filtering extension refines this se- 
lection with an infrastructural "DJ" service that ex- 
ploits individual and collaborative song preferences. 

A key observation is that an infrastructure ser- 
vice may, over time, learn user song preferences by 
observing" UI events. Songs that the user always 
"fast-forwards" past are probably songs the user 
doesn't like; in contrast, listening to a song until 
its completion may be an indication that the user 
enjoyed the song. This observation forms the basis 
for our preference inference mechanism. As we de- 
scribed in section 3.1, our graphical Jukebox client 
is decomposed into a GUI front end and a song se- 
lection back end. In our v3.0 Jukebox infrastruc- 
ture, we have decoupled this song selection from the 
client executable, moving it instead into the net- 
work as a infrastructure service so thatj our song 
selection algorithms may be upgraded and evolved 
transparently without modifying code on the client 
side. This enabled us to extend, the original unin- 
telligent song selection algorithm by interposing on 
the selection interface. 

In our DJ implementation, a rating storage ser- 
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Figure 4: The DJ collaborative filtering client GUI element 



vice in the infrastructure subscribes to client UI 
events; every time a user presses a button such as 
•'fast-forward" , a SecureRMI call is made into this 
DJ service to report the event. The DJ interprets 
these events as implicit hints about the user's song 
preferences, and updates a persistent database 3 on 
disk to reflect the new information about the user. 
Our prototype also allows users to explicitly spec- 
ify their preferences about individual songs, if they 
like. Still, the advantage of transparent preference 
inference is that it requires no extra action on the 
part of the user. 

A second key observation is that, when prefer- 
ences for many users are all stored together in the 
infrastructure, there is a great opportunity to mine 
this data for cross-user information and to provide 
collaborative services [19]. We have implemented a 
simple collaborative filtering application for the DJ. 
By default, a user's preferences are regarded as pri- 
vate and are stored securely in the infrastructure, 
with no access allowed to third parties; however, 
we allow users to publicly export read-only access 
to their preferences to other users. Marking one's 
preferences as public allows one to share preferences 
between multiple users. For example, our imple- 
mentation allows a user to temporarily use some- 
one else's preferences for song selection (assuming, 
of course, that those preferences have been explic- 
itly marked as public). More interestingly, a user 
may combine the preferences of multiple other pub- 
lic users and use the result to drive the Jukebox 
client's song selectioa algorithm. This is a useful 
way to accomodate multiple listeners with different 
preferences; for example, in an shared environment 
in which several students occupy the same office, a 
useful combination would be to play songs that are 
in the intersection of the students' sets of likable 
songs. 

The DJ extensions to the core Jukebox service 



3 We actually used a distributed, persistent hash table to 
keep track of user preferences. This hash. table (described in 
[9]) is partitioned and replicated- across nodes in a dedicated 
workstation cluster, and provides the DJ fault-tolerant access 
to the [ 



resulted in minimal changes to the existing code- 
base; rather, the extensions were mostly encapsu- 
lated within the new DJ component that was added 
to the Jukebox infrastructure. The required changes 
to the existing codebase were limited to modifica- 
tions to the Jukebox client's song selection algo- 
rithm to request a playlist from the DJ service, and 
to the enhancement of the Jukebox client front end 
to send a copy of all relevant events to the appropri- 
ate rating storage service. We also augmented the 
Jukebox client GUI to include controls that allow 
the user to explicitly indicate preferences for spe- 
cific songs (figure 4). 



4 Discussion 

In this section of the paper, we first present sev- 
eral lessons that we learned about using Java as 
a service construction language, and then we dis- 
cuss several limitations of the current Jukebox im- 
plementation. 

4.1 Java as a Service Construction Lan- 
guage 

We were surprised to find that the decision to use 
Java as a rapid prototyping tool met with mixed 
results. Certainly, Java's high-level programming 
model made for extremely rapid prototyping: the 
first version of the Jukebox service was built in 2 
days by a team of 3 students. Java's strong typing 
also encouraged modularity, which made it easier to 
extend and evolve the service several times: once 
to migrate from playing CDs in real-time towards 
serving as a shared MP3 repository, later to extend 
the service to add a security model, and a third time 
to transparently learn song preferences and to add 
support for collaborative filtering. In all three cases, 
strong typing helped assure the separation between 
client code (which should change rarely) and net- 
worked services (which may evolve frequently) that 
was a key ingredient to success. Also, the tight cou- 
pling of RMI with Java, and the existence of the 
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Ninja SecureRMI infrastructure made distributed 
programming less painful. 

What we didn't anticipate is that there were some 
negative aspects to using Java and RMI. When you 
change the implementation of some relevant class 
on one RMI endpoint. to avoid class checksum er- 
rors you must grab the new source code and re- 
compile on all other endpoints too. Thus, updates 
to the service code require synchronized updates at 
ail RMI endpoints. which is an administrative an- 
noyance. Moreover, though we didn't realize it at 
first, if we had used RMI for all of our external ser- 
vice interfaces, the situation would have been far 
worse: each upgrade to the Jukebox service would 
potentially have required the clients to be updated 
too, a terrible scenario for service evolution! Fortu- 
nately, we got lucky: most of our external interfaces 
that changed used HTTP, not RMI. Our interpre- 
tation (in retrospect) is that we should have done 
a better job of picking strongly-typed interfaces to 
the outside world (rather than having any untyped 
HTTP connections) and frozen these interfaces from 
the outset, but we didn't. We gained considerable 
leverage from the use of narrow, strongly-typed in- 
terfaces between internal Jukebox components, and 
if we were to re-implement the Ninja Jukebox, we 
would strive to do the same for all of our external 
interfaces as well. 

4.2 Limitations 

Our current prototype of the Jukebox service 
has a number of limitations. First, the Jukebox is 
not intended— in its current incarnation — as a wide- 
area distributed service. Instead, we have focused 
on providing service within a single organization. As 
an example, the MusicDirectory service is currently 
centralized, which means that it would quickly be- 
come a bottleneck if we moved to a wide-area usage 
scenario. We have also made the simplifying as- 
sumption that all nodes in the system are relatively 
close to each other (in terms of network latency and 
bandwidth), so that from the client's point of view 
all SoundSmiths are created equal. 

Although a wide-area Jukebox service would be 
limited by the capacity of the underlying network, 
with some more work we could extend Jukebox to 
address wide-area concerns. Two changes would be 
required: (1) the MusicDirectory service would have 
to become wide-area aware, using standard tech- 
niques such as replication, caching, and aggregation 
to distribute song listings around the world, and (2) 
we might want to replicate MP3's across the wide- 
area, using pre-fetching and caching to reduce the 



load on the network. Neither of these changes are 
conceptually difficult; we built the Jukebox because 
it was a service we wanted, and so we ignored these 
aspects. 

A second important limitation is that our cur- 
rent implementation is very naive about multime- 
dia operations. The MP3 data is transmitted over 
a HTTP connection, and thus inherits all of the 
problems of TCP for multimedia data: no quality- 
of-service guarantees, potentially high latency, un- 
wanted buffering, and so on. There is also no rate 
limiting; we merely blast as fast as we can, which 
runs the risk of overloading the network. Nor are 
our clients particularly sophisticated about multi- 
media issues: our MP3 player doesn't do real-time 
scheduling, so during heavy paging we occasionally 
experience playback glitches. Nonetheless, these 
issues are largely orthogonal to our research; in- 
stead, we focused on testing the hypothesis that we 
can rapidly build a highly evolvable service if we 
carefully use component architectures and strongly 
typed interfaces, and thanks to this extensibility we 
believe a future version of the Ninja Jukebox could 
easily include better multimedia delivery technol- 
ogy. 

Thirdly, our current prototype has poor perfor- 
mance for the Java security operations. Right now, 
we are using a pure-Java cryptographic library, with 
no JIT, and as a result the public-key operations are 
very CPU-intensive: the initial SecureRMI hand- 
shake currently takes about 4 seconds to complete. 
Of course, these numbers could be dramatically re- 
duced by any of many techniques (native code, pre- 
computation, caching, session-reuse, etc.), but so far 
the performance impact has been relatively innocu- 
ous. 



5 Related Work 

Keeping collections of audio files in a net- 
accessible way is obviously not a new idea. The 
simplest way to publish one's music collection to 
the net is just to make it accessible as a WWW or 
FTP archive. Many people do this today, but the 
utility of unconnected collections of audio is low. In 
order to make these collections more useful, dedi- 
cated MP3 search engines such as mp3.lycos.com 
have appeared. These search engines try to be your 
"one-stop shopping" for MP3's, by telling you where 
on the Internet you can find your favorite pirated 
songs. 4 More recently, commercial jukebox prod- 

'Lycos itself, of course, does not illegally publish copy- 
righted material. 
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ucts have become available 'hat allow you to orga- 
nize and play locally-stored MP3's, but these prod- 
ucts typically do not permit sharing between users, 
nor do they offer collaborative or interactive fea- 
tures. 

This simplest kind of jukebox system is missing 
a number of benefits that the Ninja Jukebox pro- 
vides. Simple directories of MP3 files offer no co- 
hesive framework for security-related features such 
as authenticated or pay-per-use access. In addition, 
our component architecture allows the SoundSmiths 
to be active and easily updatable participants in 
the transmission of the audio, as opposed to merely 
serving a static file. This allows features such as 
transparent format conversion ( . wav files on file sys- 
tems, raw audio on CD-ROM drives, or MP 3 files 
on file systems) and support for multiple transport 
mechanisms (streamed audio over HTTP, or VAT 
audio over a multicast IP channel). 

Another approach has recently come from 
SHOUTcast [11]. SHOUTcast is an "Internet ra- 
dio'' system that allows a site to serve an audio 
stream that can be picked up by multiple clients. 
The clients have to listen to what the SHOUTcast 
servers decide to play: they have no way to interact 
with the servers. Although each SHOUTcast server 
offers the same real-time audio stream to each of 
its clients, and though its name would imply some- 
thing more clever, its underlying technology is just 
multiple simultaneous unicasts of the same data. 
SHOUTcast servers communicate with one or more 
central databases in order to register the names 
of the programs they are currently "broadcasting". 
These databases can be queried by client programs 
(.like MP3Spy [13]) to allow users to choose what 
channels they would like to hear. 

The largest difference between SHOUTcast and 
our work is that our goal was to provide a commu- 
nal, collaborative, interactive jukebox, as opposed 
to a passive Internet radio station. That having 
been said, however, it would be possible for a Sound- 
Smith to transmit any particular song over a true 
multicast channel [7], SHOUTcast servers also do 
no user authentication; one might indeed imagine 
that an Internet radio service would have no need for 
such a thing. However, given the broad view of "au- 
thentication" taken by the Ninja Jukebox, one could 
see that implementing, for example, subscription- 
based access or pay-per-use access, could add value 
(better quality of service, for example) even to a 
non-interactive service like Internet radio. 

A related approach is the Interactive Multimedia 
Jukebox [1, 2], a system that allows one to add a 
measure of interactive preference feedback to tradi- 



tional broadcast paradigms. 

More recently, the SDMI project is starting to 
tackle the issues associated with copyright control 
and rights management, using a combination of 
tamperproof hardware (or software!) on the client 
end as well as watermarking and other technologies 
[14]. We view SDMI as largely orthogonal to our 
work: we have focused on building a music delivery 
service, rather than on what is done after the music 
has been delivered. 

There have been a number of projects involved 
in the delivery of audio and/or video over digital 
networks (for example. [5]); these projects mainly 
concern themselves with the technology of media 
delivery. In contrast, we have left that issue largely 
unaddressed, as it is orthogonal to our own goals; 
we were more interested in the mechanisms of the 
service, rather than the mechanisms of serving. 



6 Conclusions 

In this paper, we demonstrated that Java is a con- 
venient language for the construction of infrastruc- 
tural services, although there are several pitfalls and 
hurdles (such as performance, vagaries about the in- 
ternals of its RMI facilities, etc.) that need to be 
addressed or avoided in order to successfully build 
such services. We also partially validated our hy- 
pothesis that infrastructural services which explic- 
itly expose a strongly typed, programmatic API (as 
opposed to an unstructured interface designed only 
for humans) are conducive to the construction of 
complicated applications. Finally, we demonstrated 
that a distributed component architecture enabled 
the rapid development of an infrastructural Jukebox 
service, and that through the careful decomposition 
of the service into components and deliberate atten- 
tion given to the design of the service's internal and 
external interfaces, we were able to smoothly evolve 
the first generation Jukebox into a more rich and 
mature service. 
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Abstract. Integrating semantical heterogeneous databases requires rich data 
models to homogenize disparate distributed entities with relationships and to access 
them through consistent views using high level query languages In this P^we 
rtrst survev the IRO-DB system, which federates object and relational databases 
around the ODMG data model. Then, we point out some technical issues to extend 
IRO-DB to support multimedia databases on the Web We propose to make it evolve 
towards a three-tiered architecture including local data sources with adapters to 
export objects, a mediator to integrate the various data sources, and an f*°™<™ 
user interface supported by a Web browser. We show through an example that new 
heuristics and strategies for distributed query processing and optimization have to 
be incorporated. 

Keywords Interoperable database. Federated database. Object-oriented database. 
Remote data access. Schema integration. Multimedia. Web. Query processing. 

I. Introduction 

Object-oriented multidatabase systems (also referred to as federated datri»oor 
heterogeneous databases) represent the confluence « .™ jr^^g^ 
science and technology [I], among them object-onentauon [2], d.stributed databases 
and interoperability. Recently, the Internet has become the major vehtcle u> neworkmg 
industry for information access and dissemination. The Web as a serv.ee on top of the 
n«m« or Intranets focuses on transparent navigation and d *V™2 
oriented information access. Tluis. today there is a need to mtegrate the objec -onented 
multidatabase technology within multimedia Web-based systems, both for Intranet 
^plications and Internet services. This paper discusses the mtegratton of mutomedm 
and Web techniques within the IRO-DB federated database ****** 

While much of the early work in federated ^^'° nC ^^^Zd 
technoloev f31 some projects have been developed at the beginning of the 90 s based 
on Pe/asus [4] was one of the first ^*^££S!> 

around a global object model to which local objects (e.g.. tables from a relat onal DB) 
are mapped. The 'global mode, is Cose to the object-re.at.ona one and query 
language is SQL3+, an adapted version of SQL3. Started . 1993 the 
project [5] has developed a similar approach, but based on the ODMG standard [6]. 
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,RO-DB supports an interactive schema ^^ n ^^^^J^S^ 
views of the federated database, and to automatically generate mappings to exponed 
:Znl iRO DB also focuses on relationship traversal and comp ex handlmg 
through collection support. This paper first gives an ovemew *e IRO syst 
architecture and describes some of the mam component ,of *^ys«rn. 
Recently a new generation of heterogeneous database systems taa < 

Manifold of AT&T [91, and Disco from INRM j[W] -WJ *JJ^ most P J usc 
to support multimedia objects and Web technology. 

This'paper first describes the IRO-DB °*>«™™^J^ Q ^ZZ* 
Developed by a consortium of European partners IRO-DB us «° w 
application was recently demonstrated. Thus. ^1^^Z*St Web- 
support multimedia objects, such ^ 

enabled. In the next sectton. we survey the IRO-DB arch extending it to 

E the third section, we try to isolate the ^^^^^ 
support Web technology we summarize 

architecture and discuss some query processing issues, 
the contnbutions of this paper and our future plans. 

2. The IRO-DB Project 

lR0 -DB is an object-onented federated debase sy*em . A 
operational. It interconnects a re lational s «em WGR^ and j 
DBMS : 02, Matisse and Ontos. In this section, we bneny ue*.i 
features and some key components. 
2.1 Project Overview 

P T" e TL"«J ?S^7J2i Si ™ «*» <***• "<*'" "T™ 8 

architecture is to use the oumu su *"~" 1 _ ,„„„„_„. tn federate various object- 
the ODL definition language and ^.^SSS^^ ^-ided'into 
oriented and relational data sources. The IRO-DB ^ jn scvera , 

three layers, thus facilitating the ^^^'^ me SdMG standard ; the 
research centers. The local '^ r ^^£^£^ collections of 
communication layer efficiently transfers OQL request ana . security 
objects; the interoperable layer provides schema f e f*"° n ™ ' ba , query 
management, transaction management, object management, as well as a global query 
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mtevrated view). A local schema is a local database schema, as usual. An export 
St describes' in ODMG terms the subset of a database that a local syste. ^allows to 
be accessed by cooperating systems. IRO-DB does not support a global unique 
Iteg ted schema, bu't allow application administrators to define '"^^^ 
consists of a set of derived classes with relationsh.ps together w,th mappings to the 
underlying export schemas. 
2.2 System Layer Descriptions 

The architecture of the system is organized in three layers of components as 
^M^roPtl layer object definition facilities stand for specifying integntted 
schemas whTch are integrated v ews of the federated databases. We fully support the 
SI vte* Tde^on anTuagc, w,th many to many relationships. An interactive tool 
^ledThe /wo" Workbench <IW) is offered to help the database admrntstratorm 
deS ng grated views. Views and export schemas are stored . i a ^.cUon^ 
Object manipu.at.on facilities include an embedding of OQL m taOMUCH ^user 
language and modules to decompose global quenes into local ™*J#°™ W 
oTocessor) and to control global transacuons (global transaction management). Object 
SLn and manipulation facilities are built upon the integrated object manager 

^communication layer imptements object-oriented ^mote Da« Acc^ (OO 
RDA) services through the Remote Object Access (ROA) module* both on clients and 
seWrc Migration of the ROA protocol within the interoperable layer ,s provided 
Z£ Sect manager, which is able to manipulate ejection rfj^J^Jj 
it is possible to invoke OQL/CLI primitives to retrieve collections 
the local site OQL/CLI primitives include connection and deconnection. P'^ 0 " 
and exeemton of OQL queries with transfer of results through colons of object, 
p?us sorspecific^imUives to import at the interoperable layer the exported ODMG 
schemas, as well as a primitive to perform,remote method invocation. 
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The local layer is composed of Local Database Adapters (LDA). A local database 
adapter provides functionalities to make a local system able to answer OQL queries on 
an abstraction of a local schema in term of ODMG schema (export schema). As an 
export schema onlv describes locally implemented types, only locally available 
functionalities are available for querying through OQL. That means for example that 
methods and access paths cannot be invoked with simple LDA whose does not know 
how to export complex ODMG schemas. In that case, only flat objects without 
relationships are handled at the local layer for relational systems. Of course, if the local 
system supports the full ODMG model, all syntactically correct ODMG quenes are 
acceptable. 

2.3 Main System Components 
The Integrator Workbench 

The integrator workbench generates from a graphic interface the ODL description of an 
integrated view, plus some mapping definitions using OQL and C++ method profiles. 
Integration is done one exported schema after the other. At first, two exported schemas 
are displayed using a graphical view of the object model. Classes are represented as 
nodes. Two types of arcs are used to represent relationships and generalizations 
linking class nodes together. Attributes and methods are directly linked to the class 
they belong to using aggregation arcs. A naive schema integration is first perform by 
union of the two graphs. Then, integration is performed under the direction of the 
database administrator, which specifies similarities between: the export schemas. 

For example, with the CIM test-bed application that has been developed to 
demonstrate tke system [111, we integrate two databases as represented in figure 2. Site 
1 manages parts and manufacturers. Site 2 manages parts. The G::PART integrated 
view is a class with a one-lo-many relationship referencing manufacturers, i.e., 
G::MANUF. The integrator workbench generates the view definition in ODL and the 
mapping to the import schemas given in figure 3. 



1- 




Fig. 2. Example of integrated databases. 
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Interface PART 
(extent parts 
keys partjd) { 

attribute String partjd: 
attribute Date upddate: 
attribute String description: 

relationship Set <MANUF> manufs. 

inverse MANUF::parts ; } 

Mapping PART { 

origins sorig, iorig; 

def_ext select PART(sorig : sjnst, iorig : ijnst) 

from sjnst in s_parts, ijnst in i_prts 

where s jnst.partid = iinst.prtjd; 
defatt part jd as this.sorig.partid; 
def an upd_date as this.sorig.upddate; 
def att description as this.sorig.description; 
def-rel manufs as 

select me 

from me in manufs 

where ( this.sorig = me.sorig.part) or (this.iorig = me.iorig.part); 
Fig. 3. Definition of a derived class. 
The Global Query Processor 

The Global Query Processor goal consists in processing queries against ai 
schema. It is responsible for decomposing the query in order to identify the necessary 
object transfers and consequently the sub-queries that should be sent to local databases. 
It is composed of three components described below. Their complete description is 
presented in [12]. 

• Translator. It manages OQL queries expressed against derived classes and translates 
them in equivalent OQL queries expressed against imported schemas. The intuitive 
principle of translating a query expressed on derived classes consists in replacing ail 
derived class extent names by their mapping definition. The translation process uses the 
derivation specification of each class available in the repository. If there are several 
layers of derived classes, the query is translated recursively, until it refers to imported 
classes only. 

• Optimizer. The optimizer task consists in improving the query processing 
performance. For that, it applies a set of rules to minimize the object transfers and the 
communication cost between the IRO-DB client and the local databases. At first, a cost 
model was designed for the optimizer using a calibrating approach [13]. Although 
some ODBMSs were calibrated, the cost model was finally not used and heuristics 
were used. First, rules are applied to flatten the query as much as possible. Next, 
selections (i.e., restrictions and projections) are moved down the tree. Finally, caching 
rules are used to avoid transferring objects already in the client cache. 

• Decomposer. The query decomposer identifies sub-trees which refer to mono-site 
queries and generates the OQL corresponding sub-queries. The query decomposer 
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generates an execution plan composed of two distinct parts : (i) the set of sub-queries 
executed on local DBMS, (ii) the synthesis query corresponding to the part of the 
query tree to evaluate globally. 

The Global Object Manager . 
The Global Object Manager ensures virtual object management for the imported and 
derived objects. It allows objects creation and access and guarantees their identity at the 
interoperable layer. The Object Manager imports objects from external database 
systems and combines them into a uniform representation by creating surrogate objects^ 
In 1RO-DB the management of imported classes differ from me management of 
derived class. The instantiation of an imported class is not necessary for ^e evaluat on 
of an OQL query. On the contrary, for all queries that return derived objects as opposed 
to structured values, it is mandatory to create derived objects since they do not eiost in 
other remote databases. The instantiation of a virtual class is done by .ts constructor 
method. 

The Remote Object Access _ . 

The Remote Object Access component is implementing the CLI interface for OQL 
queries on TCP/IP. It provides a set of commands to manage contexts connect. on* 
export schemas, statements, results and transactions. We have modified ^ standard 
CLI interface to handle OQL queries in place of SQL, but also and maml y o handle 
collections of objects in place of tuples. Thus, primitives to handle object de«np ions 
have been added. Iterators have been introduced and standard OML primitives to cross 
a collection through an iterator have been added. r 
An ad-hoc protocol derived from FAP has been implemented to support the transfer 
primitives, using XDR for encoding. Commands are bundled into messages by -*e 
ROA and unbundled on the server by the LOA. An object -s transfered ° f 
values with a tag giving the type of each attribute. S.milar objects are organized in 
1 ecti^s (lL. fet 8 bag 8 and Z) both on the server site and the client site, according 
to the query result type. Collections of similar objects are packed in pages for 
improving the transfer rate. , 

le local daSe and to send' back results through the ^^J^^ 
(LOA). It accesses to the local data dictionary containing (the definition o ^he export 
schema referenced by the query. Then, it translates the OQL « tof toad 

queries. On top of relationnal systems, complex OQL queries are generally translated 
I several SQL queries. Assembling the results in collections is one of the most 
d°fS task To populate the dictionary from a local database schema, each LDA 
providesaS exporischema builder. This is a tool reading the relevant part of the local 

database schema and building an ODMG counterpart. 

Although we design a generic adapter for ODMG databases, we ^<*£££ 
that each ODBMS interface is specific. Thus, except the common hbnnev which are 
those of the relational LDA, the object LDAs have no co mmon mod u » es ^ c ^ 
is mainly implementing a dynamic OQL on top of 02. Each OQL quer> « u™Uued 
into a selection of object identifiers and then into some 

attributes or methods. The MATISSE LDA is very specific ; it interprets OQL queries 
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using the MATISSE C API. The ONTOS LDA is quite similar, but with a different 
target language. 

3. Extending IRO-DB to Support Web Multimedia Databases 

In this section, we discuss how we plan to extend IRO-DB to the Web, as a mediator 
supporting object-relational queries on integrated views of multiple data sources the 
query being issued from Web browsers. Due to the multimedia nature of the Web 
multimedia servers also have to be supported. As an example, we discuss the case of an' 
image server. 

3.1 Integrating the Web 

Integrating a federated database system to the Web first means accessing it through 
browsers and Http servers. Web client access is currently dominated by browser 
software from Netscape and Microsoft (Explorer) providing graphical user interfaces. 
One major characteristic of such navigators is that their functionality can be easily 
extended and adapted to specific application needs. There are at least four approaches 
to extend a browser : (1) Scripts written in a script language (e.g., JavaScript) included 
in Html pages and interpreted by the browser, (2) Applets written in Java that are small 
applications whose intermediate code is included in Html documents to be interpreted 
by browsers, (3) Plugs-in written in any language, compiled and loaded on demand 
then linked to extensible browsers, (4) ActiveX controls, a technology provided by 
Microsoft, which are persistent modules loaded from the server and integrated to 
windows browser through the DCOM Microsoft technology. 

Among these approaches, it is hard to predict which will become dominant in the 
future. However, it seems already that scripts are limited to data entry, plugs-in are too 
heavy to manage, and ActiveX controls are Microsoft dependent. Thus, Java applets 
seems to be the best technology for integrating IRO-DB functionality with browsers 

On the server side, the interest of DBMS vendors is to provide suitable frameworks 
to their customers, which allow for the easy creation or adoption of information 
services for the Web. Currently, there are four approaches to develop Web servers for 
database applications : (1) CGI (Common Gateway Interface) is the mechanism for a 
user to invoke a program that sends output Html formatted data back to the client. The 
program can invoke a database manager, e.g., through an ODBC SQL interface or a 
specific one. For example, the Oracle Web product is based on a CGI interface to 
process queries and return formatted results. (2) Servef APIs are specific interface to 
the server. The Netscape API includes a Database Connectivity Library for direct SQL 
connectivity to most relational databases, including Oracle, Informix, Sybase and 
ODBC for DBMS independent connections. The Microsoft API includes dbWeb, an 
application able to process queries from a Wob browser and handles the 
communication between the browser, an ODBC data source, and a Web server to 
display the results on an outgoing Web page. (3) Java-based server APIs use the Java 
programming language to create applets on the client side that run programs on the 
server side. The so-called JDBC (Java Database Connectivity) API proposed by 
SunSoft makes available a standard interface to any Java applet or application. JDBC is 
based on the X/OPEN SQL Call Level. Interface (CLI), the basis of ODBC. JDBC is 
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available on top of ODBC, which means that any relational dattbase iac essWc 
through JDBC. The protocol used to access a relational database from JDBC is some 
variation of the ISO Remote Data Access (RDA) protocol, depending from the 
database server. (4) ActiveX server tool, are pushed by Microsoft. They are written m 
Visual Basic or Visual C~ and run under the control of ActiveX browser objects. 

integrating the Web also means being able to access multiple data sources from the 
mteroperable.ayer runtime, hereafter referred to as the media'or A mediatorjumin^ at 
database mtegration should support most database ^"^J^^^^ 
Web technology. It should also be able to integrate loosely formatted files such as 
Html files. As introduced above, the JDBC API, which can be plugged on any type of 
formatted data, seems to be currently the best approach for re lationa. da» sou^ in 
the Web context. There are plans to support object-relauonal and object daU^es. 
Howe- -r JDBC does not yet supports multimedia data or semi-structured files Thus. 
" has "be extended with specific data types, as images, texts, Html documents, etc. 
We f l .er detail this point in the communication layer section 

Finally for mediating between various databases on the Web. a th ee-uered 
archSre L represented in figure 4, seems to be we,, suited. Tn« --^Uonuc 
the mediator home page. Through Java applets, global quer.es will formed and 
sent to the mediator using HTTP-CGI or Java to Java P^ocols (e^. RM £ The 
mediator will then decompose the query in local sub-quenes sent to local sources 
through JDBC for formatted databases, and through some extensions of .t for 
multimedia d-'- 




Fig.4.A 



3.2 Local Support for multimedia sources 



The ODL and OQL languages are the local description and man «P^™ ^"f* 
provided by LDAs in IRO-DB. ODL is sunilar to CORBA IDUbjr wA *jno doorf 
extents and relationships, (s that sufficient to describe and maniputatt ™^"»* 
objects ? Let us consider for example an Html file with references to "V^ 1 
documents are generally loosely formatted. They are often retrieved using k^ord* 
Keywords can be modeled by a method returning a ranked ^^3*7° 
keywords. References (i.e., HREF) to images can * J^ 1 * ^ 
relationships. Images are generally in Graphical Interchange « «S« S 
Photographic Experts Group (JPEG) format. The only available fimcUon on such data 
types is display. Thus, in that case, ODL is sufficient to descnbe an image. 
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However, in the case of an image DBMS, content-based queries are possible. In 
many image systems, pick lists are available for query support. Pick lists gives values 
for typical features, such as colors, textures, shapes, etc. [14]. Other operations, such as 
rotate, clip (to select an image region), overlay to check if two images intersect, are 
possible. Thus, a description of an image can be specified using ODL as given in figure 
5. 

Query processing is more difficult, as queries in multimedia servers are not exact 
match queries. Often, a search expression involves uncertainty and fuzziness, with 
comparison operators like "SIMILAR TO" in conditions. Such predicates have to be 
defined as operations at the type level in ODL. Sophisticated techniques have been 
developed for similarity measurement ; they generally compute a measure from 0 to 1 . 
Returning this measure to the mediator is not sufficient as the query processor does not 
know how to interpret and combine such numbers. Visual query languages as QBIC 
supports such fuzziness [7], but mediators do not. This is an open problem. 

Interface image 
{ attribute 

photoid int ; 

content array[ 1024, 1024] int ; 
operation 

array[10] picklist(image) ; 
image rotate (image, angle) ; 
image clip (region) ; 
boolean overlay (image, image) ; 
integer similar (image, image) ; } 



A typical query on a database federating employees with their pictures defined as 
images is then : 

SELECT Clip(Rotate(l,90),"area") 

FROM Employees E, I in E.Pictures 

WHERE E.age() > 50 and LsimilarCBandit") > 0.8 ; 

It retrieves all employees older than 50. which look like the bandit picture. Each 
resulting picture is rotated of 90° and cut using the clip function to fit in the screen. 
Processing such a query assuming that employees are handled by some objet-relational 
database server and that images are managed by a multimedia server is a difficult task 
that we further investigate below. 

3.3 Extending the Communication Layer 

The IRO-DB communication layer provides facility to communicate from the mediator 
to the local adapters. With Java, JDBC can be used to query arid update relational 
databases as explained above. However, JDBC has several shortcomings. First, it is 
based on SQL and does not support objects yet. An OQL version seems to be in 
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Dreoaration. Second, it does not support global transactions. Integration to transactional 
monitors is hard to develop. Thus, we plan to replace the 1RO-DB communication 
layer by JDBC to which some extensions will be incorporated. For multimedia servers, 
it is not obvious that JDBC will be the right choice. We might then use specific 
communication protocols based on RMI for example, the Java remote method 
invocation protocol. 

3.4 Extending the Interoperable Layer 

The IRO-DB interoperable layer includes design help components and query 
processing components. Both have to be extended as briefly indicated below. 
3.4. 1 The Integrator Workbench 

The Integrator Workbench should support new date types and new mapping functions 
from one data type to another. For example, integrating two sets of .mages, «e in GIF. 
the other in JPEG, requires the knowledge of this date types and of a conversion 
function from JPEG to GIF. This seems to be feasible by integrating l.btanes of 
components in the Workbench. More difficult will be the integration of sern.-st^ctured 
data types, such as Html files. Semi-structured type templates should be abated 
from files and integrated in the workbench. The less specific will be the type templates, 
the less integration will be feasible. 
3.4.2 The Query Processor 

In IRO-DB, the query processor does not integrate cost-based optimization as stated 
^ovTlt just appUsimple heuristics as push selection firs,. Tnis strategy - couM give 
bad results in a multimedia context. For example, using the query given in , the ^example 
above, the optimizer might generate two sub-queries, respectively sent to the employee 
database and to the picture database, as follows : 
(Ql) SELECT E.lmages 
FROM Employees E 
WHERE E.age()> 50 
(Q2) SELECT 1, res = ClirXRotated.W),"^") 
FROM I in Pictures 
WHERE l.similar("Bandit H ) > 0.8 : 
and a synthesis query run on the mediator : 
(Q3) SELECT res 

FROM E in Ql E, I in Q2 

WHERE I = E.lmages. . . ... _ ... 

This will be a very bad plan, as every image will be compared to the band, on *e 
multimedia site and many will be transferred on the net f ^^^s7g 
requires clever heuristics or even a full object cost model 115]. Some opera,, ons (e* 
clip) are only possible on certain sites. Thus, algorithms that use the source descnp W 
to genemteVxecutable query plans are required [16). Also ^^^^ 
expressions and knowing operation costs is no, an obvious task. A general framework 
for this problem is proposed in [17]. 
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3.4.3 The Object Manager 

The IRO-DB object manager only handles the basic data types of OQL. This is 
insufficient on the Web, with multimedia objects. A question is : should it handle all 
data types imported from local databases ? If yes, the object manager has to be 
extensible. Every manipulated data type should be integrated at the level of the object 
manager. Thus, when exporting a new type, the LDA will have to supply it to the 
object manager of the mediator. Except if a portable and transferable language is used 
everywhere (e.g., Java), this is impracticable. One (poor) solution is to restrict the set 
of available data types. For example, BLOBs could be used to transfer images. Another 
(rich) solution is to develop ail abstract data types in Java and to download the 
bytecode where needed. Sending methods should then be taken into account by the 
query optimizer, as sending bytecode might be costly. 

4. Conclusion 

In this paper, we describe the IRO-DB federated object-oriented database system 
architecture. IRO-DB federates relational and object-oriented databases through an 
object-oriented data model with an associated object query language derived from the 
ODMG proposal. The overall aim of the project — the provision of building blocks for 
federated database management — has been recognized worldwide as one of the most 
important challenges in database research and development at the beginning of the 90's. 
The project develops suitable object-oriented generalizations of the SQL Access Group 
protocols extended with object-oriented features for the exchange of complex objects. 
It takes into account relevant industrial data interchange standards or proposals, such as 
the OMG architecture and the ODMG model. IRO-DB is a joint effort in Europe, 
which integrates components developed by various partners. The system is currently 
operational and a demonstrator CIM application is available. 

Our goal is now to extend IRO-DB tq make it Web-enabled and to support 
multimedia data sources. Through this paper, we isolate some of the difficulties to do 
so. It appears that rewriting the system in Java and using Java packages to support 
multimedia types and to access databases will help solving many problems. However, 
it is insufficient. As demonstrated, query optimization has to be re-considered. User 
interface objects have to be developed as applets and template data types have to be 
plug in to support semi-structured files. Further, the management of metadata has to be 
improved to support a large number of data sources, with quick structural changes. 
However, we do not believe that developing a new data model is necessary as the 
ODMG data model is sufficient to support references as relationships and complex 
objects as collections. SQL3 provides the same features. We hope that these two 
models will converge to a unique one sufficient for most Intranet/Internet federated 
database applications. 
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Media Player Plugin's Properties Pa g e 1 of 3 
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t II: Netscape 

[edia Player Plugin f s Methods 

i supports several methods. We've built a demo jukebox to show their usage. W< 
all, Normal, and Large. Here is the HTML form: 

' Pause " NAME="playOrPause" O nClick="handlePlay0rPauseClick ( ) » £ 
' Hide Controls " NAME= " controls " onClick="handleControlsOnOf f CI 
• Small " NAME=" small" onClick="changeSize (1) " STYLE= " font - f ami 1> 

■ Normal " NAME="normal" onClick="changeSize (0) " STYLE= " font - f ami 

■ Large " NAME="large" onClick="changeSize (2) " STYLE=" font- f ami] 



iy button. Every clicking of this button carries out the proper command and swit 
:tion is the event handler for this button: 



2diaPlayer . GetPlayState ( ) ; 
ay ( ) ; 

ayOrPause. value = " Pause " ; 



1) { 
ay ( ) ; 
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